faculty of cognitive sciences and human development tree classification... · figure 4:...

24
Faculty of Cognitive Sciences and Human Development PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM Tan Jia Kae Bachelor of Science with Honours (Cognitive Science) 2015

Upload: doandiep

Post on 24-Mar-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

Faculty of Cognitive Sciences and Human Development

PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM

Tan Jia Kae

Bachelor of Science with Honours (Cognitive Science)

2015

UNIVERSITI MALAYSIA SARAWAK

Grade _____

Please tick one

Final Year Project Report IZI Masters D PhD D

DECLARATION OF ORIGINAL WORK

This declaration is made on the 05 day of JUNE year 2015

Students Declaration I TAN JIA KAE 39023 FACULTY OF COGNITIVE SCIENCES AND HUMAN DEVELOPMENT hereby declare that the work entitled PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM is my original work I have not copied from any other students work or from any other sources with the exception where due reference or acknowledgement is made explicitly in the text nor has any part of the work been written for me by another person

5 JUNE 2015

TAN JIA KAE (39023)

Supervisors Declaration I DR LEE NUNG KlON hereby certify that the work entitled PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM was prepared by the aforementioned or above mentioned student and was submitted to the FACULTY as a partiallfull fulfillment for the conferment of BACHELOR OF SCIENCE WITH HONOURS (COGNITIVE SCIENCE) and the aforementioned work to the best of my knowledge is the said students work ~

5 JUNE 2015 Date ________Received for examination by

I declare this ProjectThesis is classified as (Please tick (Jraquo

o CONFIDENTIAL (Contains confidential information under the Official Secret Act 1972)

o RESTRICTED (Contains restricted information as specified by the organisation where research was done)

~ OPEN ACCESS

I declare this ProjectThesis is to be submitted to the Centre for Academic Information Services (CAIS) and uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick 0raquo

~ YES

o NO

Validation of ProjectJThesis

I hereby duly affirmed with free consent and willingness declared that this said ProjectThesis shall be placed officially in the Centre for Academic Information Services with the abide interest and rights as follows

bull This ProjectThesis is the sole legal property of Universiti Malaysia Sarawak (UNlMAS)

bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis for academic and research purposes only and not for other purposes

bull The Centre for Academic Information Services has the lawful right to digitize the content to be uploaded into Local Content Database

bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis if required for use by other parties for academic purposes or by other Higher Learning Institutes

bull No dispute or any claim shall arise from the student himself herself neither a third party on this ProjectThesis once it becomes the sole property of UNlMAS

bull This ProjectThesis or any material data and information related to it shall not be distributed published or disclosed to any party by the student himselflherself without

Supervisors signature ___~(f=I---------f-___ Date 5 JUN~15

Current Address Universiti Malaysia Sarawak 94300 Kota Samarahan Sarawak

Notes If the ProjectThesis is CONFIDENTIAL or RESTRICTED please attach together as annexure a letter from the organisation with the date of restriction indicated and the reasons for the confidentiality and restriction

first obtaining a proval from UNlMAS t

Students signature -1----------shyDate

I

ttusat KJIIdlBlt Maldut Akad~mik UNIVERSlTI MALAYSIA SAltAWAK

PHYLOGENETIC TREE CLASSIFICATION SYSTEM BY USING MACHINE LEARNING ALGORITHM

TANJIAKAE

This project is submitted in partial fulfilment of the requirements for a

Bachelor of Science with Honours (Cognitive Science)

I

I

Faculty of Cognitive Sciences and Human Development UNIVERSITI MALAYSIA SARA W AK

(2015)

-

The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for examination by

--------------------~--(Dr Lee Nung Kion)

Date 5 June 2015

Grade

II

ACKNOWLEDGEMENTS

First and foremost I would like to take this opportunity to express my deepest

appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his

precious time in order to give me a lot of remarks as well as sharing his superior knowledge

experience and expertise during the process in completing my Final Year Project Without his

guidance my project would not be completed successfully at the limited of time

Next I am deeply indebted to my family for affording their unceasing encouragement

support and attention effluence to me during the whole process of doing my Final Year Project

study especially for those periods that I really need some of their love to help me finish my Final

Year Project Thesis

In addition I would like to thank to all my friends and course mates who supported and

encouraged me in completion of this project During the completion of this project I faced some

ofdifficulties that would pull me to give up Luckily they are giving me full of advices and

support that give me the strength and confidence to finish my Final Year Project Thesis

III

Pusa unnit MwumalA Oil (-1 bullbull

UNlVEKSITI MALAYSIA SAItAWAK

TABLE OF CONTENTS

LIST OF TABLES v

LIST OF FIGURES vi

ABSTRACT viii

ABSTRAK ix

CHAPTER ONE INTRODUCTION 1

CHAPTER TWO LITERATURE REVIEW 11

CHAPTER THREE METHODOLOGY 39

CHAPTER FOUR RESULT AND DISCUSSION 62

CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69

REFERENCE 73

APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79

IV

LIST OF TABLES

Table I Phylogenetic Tree Classification Cross-validation results based on different features 62

Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66

v

LIST OF FIGURES

Figure 1 The first evolution tree diagram sketched by Darwin 3 I

Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3

Figure 3 Non-phylogenetic tree- family tree 8

Figure 4 phylogenetic rooted-tree rectangular cladogram 13

Figure 5 phylogenetic rooted-tree Slanted diagram 13

Figure 6 phylogenetic unrooted-tree circular cladogram 14

Figure 7 phylogenetic scaled-tree 16

Figure 8 phylogenetic unscaled-tree 16

Figure 9 A quick review of phylogenetic tree 19

Figure 10 Object detection in computer perception 25

Figure 11 Feature Representation 25

Figure 12 SIFT 27

Figure 13 RIFT - 27

Figure 14 Spin image 28

Figure 15 Pre-pocessing of model Objects 32

Figure 16 Recognition of object in the scene 33

Figure 17 TreeRipper 36

Figure 18 TreeSnatcher Plus 37

Figure 19 Windows Snipping Toolbox 44

Figure 20 Original lpng 46

Figure 21 After Thresholding 46

Figure 22 Grayscale image 46

Figure 23 SURF Feature Detection and Extraction 53

vi

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 2: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

UNIVERSITI MALAYSIA SARAWAK

Grade _____

Please tick one

Final Year Project Report IZI Masters D PhD D

DECLARATION OF ORIGINAL WORK

This declaration is made on the 05 day of JUNE year 2015

Students Declaration I TAN JIA KAE 39023 FACULTY OF COGNITIVE SCIENCES AND HUMAN DEVELOPMENT hereby declare that the work entitled PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM is my original work I have not copied from any other students work or from any other sources with the exception where due reference or acknowledgement is made explicitly in the text nor has any part of the work been written for me by another person

5 JUNE 2015

TAN JIA KAE (39023)

Supervisors Declaration I DR LEE NUNG KlON hereby certify that the work entitled PHYLOGENETIC TREE CLASSIFICATION SYSTEM USING MACHINE LEARNING ALGORITHM was prepared by the aforementioned or above mentioned student and was submitted to the FACULTY as a partiallfull fulfillment for the conferment of BACHELOR OF SCIENCE WITH HONOURS (COGNITIVE SCIENCE) and the aforementioned work to the best of my knowledge is the said students work ~

5 JUNE 2015 Date ________Received for examination by

I declare this ProjectThesis is classified as (Please tick (Jraquo

o CONFIDENTIAL (Contains confidential information under the Official Secret Act 1972)

o RESTRICTED (Contains restricted information as specified by the organisation where research was done)

~ OPEN ACCESS

I declare this ProjectThesis is to be submitted to the Centre for Academic Information Services (CAIS) and uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick 0raquo

~ YES

o NO

Validation of ProjectJThesis

I hereby duly affirmed with free consent and willingness declared that this said ProjectThesis shall be placed officially in the Centre for Academic Information Services with the abide interest and rights as follows

bull This ProjectThesis is the sole legal property of Universiti Malaysia Sarawak (UNlMAS)

bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis for academic and research purposes only and not for other purposes

bull The Centre for Academic Information Services has the lawful right to digitize the content to be uploaded into Local Content Database

bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis if required for use by other parties for academic purposes or by other Higher Learning Institutes

bull No dispute or any claim shall arise from the student himself herself neither a third party on this ProjectThesis once it becomes the sole property of UNlMAS

bull This ProjectThesis or any material data and information related to it shall not be distributed published or disclosed to any party by the student himselflherself without

Supervisors signature ___~(f=I---------f-___ Date 5 JUN~15

Current Address Universiti Malaysia Sarawak 94300 Kota Samarahan Sarawak

Notes If the ProjectThesis is CONFIDENTIAL or RESTRICTED please attach together as annexure a letter from the organisation with the date of restriction indicated and the reasons for the confidentiality and restriction

first obtaining a proval from UNlMAS t

Students signature -1----------shyDate

I

ttusat KJIIdlBlt Maldut Akad~mik UNIVERSlTI MALAYSIA SAltAWAK

PHYLOGENETIC TREE CLASSIFICATION SYSTEM BY USING MACHINE LEARNING ALGORITHM

TANJIAKAE

This project is submitted in partial fulfilment of the requirements for a

Bachelor of Science with Honours (Cognitive Science)

I

I

Faculty of Cognitive Sciences and Human Development UNIVERSITI MALAYSIA SARA W AK

(2015)

-

The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for examination by

--------------------~--(Dr Lee Nung Kion)

Date 5 June 2015

Grade

II

ACKNOWLEDGEMENTS

First and foremost I would like to take this opportunity to express my deepest

appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his

precious time in order to give me a lot of remarks as well as sharing his superior knowledge

experience and expertise during the process in completing my Final Year Project Without his

guidance my project would not be completed successfully at the limited of time

Next I am deeply indebted to my family for affording their unceasing encouragement

support and attention effluence to me during the whole process of doing my Final Year Project

study especially for those periods that I really need some of their love to help me finish my Final

Year Project Thesis

In addition I would like to thank to all my friends and course mates who supported and

encouraged me in completion of this project During the completion of this project I faced some

ofdifficulties that would pull me to give up Luckily they are giving me full of advices and

support that give me the strength and confidence to finish my Final Year Project Thesis

III

Pusa unnit MwumalA Oil (-1 bullbull

UNlVEKSITI MALAYSIA SAItAWAK

TABLE OF CONTENTS

LIST OF TABLES v

LIST OF FIGURES vi

ABSTRACT viii

ABSTRAK ix

CHAPTER ONE INTRODUCTION 1

CHAPTER TWO LITERATURE REVIEW 11

CHAPTER THREE METHODOLOGY 39

CHAPTER FOUR RESULT AND DISCUSSION 62

CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69

REFERENCE 73

APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79

IV

LIST OF TABLES

Table I Phylogenetic Tree Classification Cross-validation results based on different features 62

Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66

v

LIST OF FIGURES

Figure 1 The first evolution tree diagram sketched by Darwin 3 I

Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3

Figure 3 Non-phylogenetic tree- family tree 8

Figure 4 phylogenetic rooted-tree rectangular cladogram 13

Figure 5 phylogenetic rooted-tree Slanted diagram 13

Figure 6 phylogenetic unrooted-tree circular cladogram 14

Figure 7 phylogenetic scaled-tree 16

Figure 8 phylogenetic unscaled-tree 16

Figure 9 A quick review of phylogenetic tree 19

Figure 10 Object detection in computer perception 25

Figure 11 Feature Representation 25

Figure 12 SIFT 27

Figure 13 RIFT - 27

Figure 14 Spin image 28

Figure 15 Pre-pocessing of model Objects 32

Figure 16 Recognition of object in the scene 33

Figure 17 TreeRipper 36

Figure 18 TreeSnatcher Plus 37

Figure 19 Windows Snipping Toolbox 44

Figure 20 Original lpng 46

Figure 21 After Thresholding 46

Figure 22 Grayscale image 46

Figure 23 SURF Feature Detection and Extraction 53

vi

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 3: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

I declare this ProjectThesis is classified as (Please tick (Jraquo

o CONFIDENTIAL (Contains confidential information under the Official Secret Act 1972)

o RESTRICTED (Contains restricted information as specified by the organisation where research was done)

~ OPEN ACCESS

I declare this ProjectThesis is to be submitted to the Centre for Academic Information Services (CAIS) and uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick 0raquo

~ YES

o NO

Validation of ProjectJThesis

I hereby duly affirmed with free consent and willingness declared that this said ProjectThesis shall be placed officially in the Centre for Academic Information Services with the abide interest and rights as follows

bull This ProjectThesis is the sole legal property of Universiti Malaysia Sarawak (UNlMAS)

bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis for academic and research purposes only and not for other purposes

bull The Centre for Academic Information Services has the lawful right to digitize the content to be uploaded into Local Content Database

bull The Centre for Academic Information Services has the lawful right to make copies of the ProjectThesis if required for use by other parties for academic purposes or by other Higher Learning Institutes

bull No dispute or any claim shall arise from the student himself herself neither a third party on this ProjectThesis once it becomes the sole property of UNlMAS

bull This ProjectThesis or any material data and information related to it shall not be distributed published or disclosed to any party by the student himselflherself without

Supervisors signature ___~(f=I---------f-___ Date 5 JUN~15

Current Address Universiti Malaysia Sarawak 94300 Kota Samarahan Sarawak

Notes If the ProjectThesis is CONFIDENTIAL or RESTRICTED please attach together as annexure a letter from the organisation with the date of restriction indicated and the reasons for the confidentiality and restriction

first obtaining a proval from UNlMAS t

Students signature -1----------shyDate

I

ttusat KJIIdlBlt Maldut Akad~mik UNIVERSlTI MALAYSIA SAltAWAK

PHYLOGENETIC TREE CLASSIFICATION SYSTEM BY USING MACHINE LEARNING ALGORITHM

TANJIAKAE

This project is submitted in partial fulfilment of the requirements for a

Bachelor of Science with Honours (Cognitive Science)

I

I

Faculty of Cognitive Sciences and Human Development UNIVERSITI MALAYSIA SARA W AK

(2015)

-

The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for examination by

--------------------~--(Dr Lee Nung Kion)

Date 5 June 2015

Grade

II

ACKNOWLEDGEMENTS

First and foremost I would like to take this opportunity to express my deepest

appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his

precious time in order to give me a lot of remarks as well as sharing his superior knowledge

experience and expertise during the process in completing my Final Year Project Without his

guidance my project would not be completed successfully at the limited of time

Next I am deeply indebted to my family for affording their unceasing encouragement

support and attention effluence to me during the whole process of doing my Final Year Project

study especially for those periods that I really need some of their love to help me finish my Final

Year Project Thesis

In addition I would like to thank to all my friends and course mates who supported and

encouraged me in completion of this project During the completion of this project I faced some

ofdifficulties that would pull me to give up Luckily they are giving me full of advices and

support that give me the strength and confidence to finish my Final Year Project Thesis

III

Pusa unnit MwumalA Oil (-1 bullbull

UNlVEKSITI MALAYSIA SAItAWAK

TABLE OF CONTENTS

LIST OF TABLES v

LIST OF FIGURES vi

ABSTRACT viii

ABSTRAK ix

CHAPTER ONE INTRODUCTION 1

CHAPTER TWO LITERATURE REVIEW 11

CHAPTER THREE METHODOLOGY 39

CHAPTER FOUR RESULT AND DISCUSSION 62

CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69

REFERENCE 73

APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79

IV

LIST OF TABLES

Table I Phylogenetic Tree Classification Cross-validation results based on different features 62

Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66

v

LIST OF FIGURES

Figure 1 The first evolution tree diagram sketched by Darwin 3 I

Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3

Figure 3 Non-phylogenetic tree- family tree 8

Figure 4 phylogenetic rooted-tree rectangular cladogram 13

Figure 5 phylogenetic rooted-tree Slanted diagram 13

Figure 6 phylogenetic unrooted-tree circular cladogram 14

Figure 7 phylogenetic scaled-tree 16

Figure 8 phylogenetic unscaled-tree 16

Figure 9 A quick review of phylogenetic tree 19

Figure 10 Object detection in computer perception 25

Figure 11 Feature Representation 25

Figure 12 SIFT 27

Figure 13 RIFT - 27

Figure 14 Spin image 28

Figure 15 Pre-pocessing of model Objects 32

Figure 16 Recognition of object in the scene 33

Figure 17 TreeRipper 36

Figure 18 TreeSnatcher Plus 37

Figure 19 Windows Snipping Toolbox 44

Figure 20 Original lpng 46

Figure 21 After Thresholding 46

Figure 22 Grayscale image 46

Figure 23 SURF Feature Detection and Extraction 53

vi

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 4: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

I

ttusat KJIIdlBlt Maldut Akad~mik UNIVERSlTI MALAYSIA SAltAWAK

PHYLOGENETIC TREE CLASSIFICATION SYSTEM BY USING MACHINE LEARNING ALGORITHM

TANJIAKAE

This project is submitted in partial fulfilment of the requirements for a

Bachelor of Science with Honours (Cognitive Science)

I

I

Faculty of Cognitive Sciences and Human Development UNIVERSITI MALAYSIA SARA W AK

(2015)

-

The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for examination by

--------------------~--(Dr Lee Nung Kion)

Date 5 June 2015

Grade

II

ACKNOWLEDGEMENTS

First and foremost I would like to take this opportunity to express my deepest

appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his

precious time in order to give me a lot of remarks as well as sharing his superior knowledge

experience and expertise during the process in completing my Final Year Project Without his

guidance my project would not be completed successfully at the limited of time

Next I am deeply indebted to my family for affording their unceasing encouragement

support and attention effluence to me during the whole process of doing my Final Year Project

study especially for those periods that I really need some of their love to help me finish my Final

Year Project Thesis

In addition I would like to thank to all my friends and course mates who supported and

encouraged me in completion of this project During the completion of this project I faced some

ofdifficulties that would pull me to give up Luckily they are giving me full of advices and

support that give me the strength and confidence to finish my Final Year Project Thesis

III

Pusa unnit MwumalA Oil (-1 bullbull

UNlVEKSITI MALAYSIA SAItAWAK

TABLE OF CONTENTS

LIST OF TABLES v

LIST OF FIGURES vi

ABSTRACT viii

ABSTRAK ix

CHAPTER ONE INTRODUCTION 1

CHAPTER TWO LITERATURE REVIEW 11

CHAPTER THREE METHODOLOGY 39

CHAPTER FOUR RESULT AND DISCUSSION 62

CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69

REFERENCE 73

APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79

IV

LIST OF TABLES

Table I Phylogenetic Tree Classification Cross-validation results based on different features 62

Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66

v

LIST OF FIGURES

Figure 1 The first evolution tree diagram sketched by Darwin 3 I

Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3

Figure 3 Non-phylogenetic tree- family tree 8

Figure 4 phylogenetic rooted-tree rectangular cladogram 13

Figure 5 phylogenetic rooted-tree Slanted diagram 13

Figure 6 phylogenetic unrooted-tree circular cladogram 14

Figure 7 phylogenetic scaled-tree 16

Figure 8 phylogenetic unscaled-tree 16

Figure 9 A quick review of phylogenetic tree 19

Figure 10 Object detection in computer perception 25

Figure 11 Feature Representation 25

Figure 12 SIFT 27

Figure 13 RIFT - 27

Figure 14 Spin image 28

Figure 15 Pre-pocessing of model Objects 32

Figure 16 Recognition of object in the scene 33

Figure 17 TreeRipper 36

Figure 18 TreeSnatcher Plus 37

Figure 19 Windows Snipping Toolbox 44

Figure 20 Original lpng 46

Figure 21 After Thresholding 46

Figure 22 Grayscale image 46

Figure 23 SURF Feature Detection and Extraction 53

vi

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 5: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

The project entitled Phylogenetic tree classification system by using machine learning algorithm was prepared by Tan Jia Kae and submitted to the Faculty of Cognitive Sciences and Human Development in partial fulfilment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for examination by

--------------------~--(Dr Lee Nung Kion)

Date 5 June 2015

Grade

II

ACKNOWLEDGEMENTS

First and foremost I would like to take this opportunity to express my deepest

appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his

precious time in order to give me a lot of remarks as well as sharing his superior knowledge

experience and expertise during the process in completing my Final Year Project Without his

guidance my project would not be completed successfully at the limited of time

Next I am deeply indebted to my family for affording their unceasing encouragement

support and attention effluence to me during the whole process of doing my Final Year Project

study especially for those periods that I really need some of their love to help me finish my Final

Year Project Thesis

In addition I would like to thank to all my friends and course mates who supported and

encouraged me in completion of this project During the completion of this project I faced some

ofdifficulties that would pull me to give up Luckily they are giving me full of advices and

support that give me the strength and confidence to finish my Final Year Project Thesis

III

Pusa unnit MwumalA Oil (-1 bullbull

UNlVEKSITI MALAYSIA SAItAWAK

TABLE OF CONTENTS

LIST OF TABLES v

LIST OF FIGURES vi

ABSTRACT viii

ABSTRAK ix

CHAPTER ONE INTRODUCTION 1

CHAPTER TWO LITERATURE REVIEW 11

CHAPTER THREE METHODOLOGY 39

CHAPTER FOUR RESULT AND DISCUSSION 62

CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69

REFERENCE 73

APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79

IV

LIST OF TABLES

Table I Phylogenetic Tree Classification Cross-validation results based on different features 62

Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66

v

LIST OF FIGURES

Figure 1 The first evolution tree diagram sketched by Darwin 3 I

Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3

Figure 3 Non-phylogenetic tree- family tree 8

Figure 4 phylogenetic rooted-tree rectangular cladogram 13

Figure 5 phylogenetic rooted-tree Slanted diagram 13

Figure 6 phylogenetic unrooted-tree circular cladogram 14

Figure 7 phylogenetic scaled-tree 16

Figure 8 phylogenetic unscaled-tree 16

Figure 9 A quick review of phylogenetic tree 19

Figure 10 Object detection in computer perception 25

Figure 11 Feature Representation 25

Figure 12 SIFT 27

Figure 13 RIFT - 27

Figure 14 Spin image 28

Figure 15 Pre-pocessing of model Objects 32

Figure 16 Recognition of object in the scene 33

Figure 17 TreeRipper 36

Figure 18 TreeSnatcher Plus 37

Figure 19 Windows Snipping Toolbox 44

Figure 20 Original lpng 46

Figure 21 After Thresholding 46

Figure 22 Grayscale image 46

Figure 23 SURF Feature Detection and Extraction 53

vi

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 6: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

ACKNOWLEDGEMENTS

First and foremost I would like to take this opportunity to express my deepest

appreciation to my supervisor Dr Lee Nung Kion for his generous and patient by spending his

precious time in order to give me a lot of remarks as well as sharing his superior knowledge

experience and expertise during the process in completing my Final Year Project Without his

guidance my project would not be completed successfully at the limited of time

Next I am deeply indebted to my family for affording their unceasing encouragement

support and attention effluence to me during the whole process of doing my Final Year Project

study especially for those periods that I really need some of their love to help me finish my Final

Year Project Thesis

In addition I would like to thank to all my friends and course mates who supported and

encouraged me in completion of this project During the completion of this project I faced some

ofdifficulties that would pull me to give up Luckily they are giving me full of advices and

support that give me the strength and confidence to finish my Final Year Project Thesis

III

Pusa unnit MwumalA Oil (-1 bullbull

UNlVEKSITI MALAYSIA SAItAWAK

TABLE OF CONTENTS

LIST OF TABLES v

LIST OF FIGURES vi

ABSTRACT viii

ABSTRAK ix

CHAPTER ONE INTRODUCTION 1

CHAPTER TWO LITERATURE REVIEW 11

CHAPTER THREE METHODOLOGY 39

CHAPTER FOUR RESULT AND DISCUSSION 62

CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69

REFERENCE 73

APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79

IV

LIST OF TABLES

Table I Phylogenetic Tree Classification Cross-validation results based on different features 62

Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66

v

LIST OF FIGURES

Figure 1 The first evolution tree diagram sketched by Darwin 3 I

Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3

Figure 3 Non-phylogenetic tree- family tree 8

Figure 4 phylogenetic rooted-tree rectangular cladogram 13

Figure 5 phylogenetic rooted-tree Slanted diagram 13

Figure 6 phylogenetic unrooted-tree circular cladogram 14

Figure 7 phylogenetic scaled-tree 16

Figure 8 phylogenetic unscaled-tree 16

Figure 9 A quick review of phylogenetic tree 19

Figure 10 Object detection in computer perception 25

Figure 11 Feature Representation 25

Figure 12 SIFT 27

Figure 13 RIFT - 27

Figure 14 Spin image 28

Figure 15 Pre-pocessing of model Objects 32

Figure 16 Recognition of object in the scene 33

Figure 17 TreeRipper 36

Figure 18 TreeSnatcher Plus 37

Figure 19 Windows Snipping Toolbox 44

Figure 20 Original lpng 46

Figure 21 After Thresholding 46

Figure 22 Grayscale image 46

Figure 23 SURF Feature Detection and Extraction 53

vi

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 7: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

Pusa unnit MwumalA Oil (-1 bullbull

UNlVEKSITI MALAYSIA SAItAWAK

TABLE OF CONTENTS

LIST OF TABLES v

LIST OF FIGURES vi

ABSTRACT viii

ABSTRAK ix

CHAPTER ONE INTRODUCTION 1

CHAPTER TWO LITERATURE REVIEW 11

CHAPTER THREE METHODOLOGY 39

CHAPTER FOUR RESULT AND DISCUSSION 62

CHAPTER FIVE CONCLUSION AND RECOMMENDATION 69

REFERENCE 73

APPENDIX A PHYLOGENETIC TREE CLASSIFICATION SYSTEM MATLAB CODING79

IV

LIST OF TABLES

Table I Phylogenetic Tree Classification Cross-validation results based on different features 62

Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66

v

LIST OF FIGURES

Figure 1 The first evolution tree diagram sketched by Darwin 3 I

Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3

Figure 3 Non-phylogenetic tree- family tree 8

Figure 4 phylogenetic rooted-tree rectangular cladogram 13

Figure 5 phylogenetic rooted-tree Slanted diagram 13

Figure 6 phylogenetic unrooted-tree circular cladogram 14

Figure 7 phylogenetic scaled-tree 16

Figure 8 phylogenetic unscaled-tree 16

Figure 9 A quick review of phylogenetic tree 19

Figure 10 Object detection in computer perception 25

Figure 11 Feature Representation 25

Figure 12 SIFT 27

Figure 13 RIFT - 27

Figure 14 Spin image 28

Figure 15 Pre-pocessing of model Objects 32

Figure 16 Recognition of object in the scene 33

Figure 17 TreeRipper 36

Figure 18 TreeSnatcher Plus 37

Figure 19 Windows Snipping Toolbox 44

Figure 20 Original lpng 46

Figure 21 After Thresholding 46

Figure 22 Grayscale image 46

Figure 23 SURF Feature Detection and Extraction 53

vi

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 8: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

LIST OF TABLES

Table I Phylogenetic Tree Classification Cross-validation results based on different features 62

Table 2 lO-fold cross-validation results with 540 training data and 60 testing data each fold 66

v

LIST OF FIGURES

Figure 1 The first evolution tree diagram sketched by Darwin 3 I

Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3

Figure 3 Non-phylogenetic tree- family tree 8

Figure 4 phylogenetic rooted-tree rectangular cladogram 13

Figure 5 phylogenetic rooted-tree Slanted diagram 13

Figure 6 phylogenetic unrooted-tree circular cladogram 14

Figure 7 phylogenetic scaled-tree 16

Figure 8 phylogenetic unscaled-tree 16

Figure 9 A quick review of phylogenetic tree 19

Figure 10 Object detection in computer perception 25

Figure 11 Feature Representation 25

Figure 12 SIFT 27

Figure 13 RIFT - 27

Figure 14 Spin image 28

Figure 15 Pre-pocessing of model Objects 32

Figure 16 Recognition of object in the scene 33

Figure 17 TreeRipper 36

Figure 18 TreeSnatcher Plus 37

Figure 19 Windows Snipping Toolbox 44

Figure 20 Original lpng 46

Figure 21 After Thresholding 46

Figure 22 Grayscale image 46

Figure 23 SURF Feature Detection and Extraction 53

vi

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 9: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

LIST OF FIGURES

Figure 1 The first evolution tree diagram sketched by Darwin 3 I

Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the term molecular and phylogeny in the keywords or abstract 3

Figure 3 Non-phylogenetic tree- family tree 8

Figure 4 phylogenetic rooted-tree rectangular cladogram 13

Figure 5 phylogenetic rooted-tree Slanted diagram 13

Figure 6 phylogenetic unrooted-tree circular cladogram 14

Figure 7 phylogenetic scaled-tree 16

Figure 8 phylogenetic unscaled-tree 16

Figure 9 A quick review of phylogenetic tree 19

Figure 10 Object detection in computer perception 25

Figure 11 Feature Representation 25

Figure 12 SIFT 27

Figure 13 RIFT - 27

Figure 14 Spin image 28

Figure 15 Pre-pocessing of model Objects 32

Figure 16 Recognition of object in the scene 33

Figure 17 TreeRipper 36

Figure 18 TreeSnatcher Plus 37

Figure 19 Windows Snipping Toolbox 44

Figure 20 Original lpng 46

Figure 21 After Thresholding 46

Figure 22 Grayscale image 46

Figure 23 SURF Feature Detection and Extraction 53

vi

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 10: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

LIST OF FIGURES

Figure 24 GIST Feature Detection and Extraction 54

Figure 25 lO-fold cross-validation accuracy 63

Figure 26 Example of Graphic User Interface for the Phylogenetic Tree Image Classification system 67

Figure 27 Graphic User Interface for the Phylogenetic Tree Image Classification system 67

vii

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 11: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

ABSTRACT

A study is conducted to develop an automated phylogenetic tree image classification system by

using machine learning algorithm This study adopted supervised machine learning algorithm

which is the Support Vector Machine (SVM) for classification Image data were collected from

online databases PUBMED ScienceDirect and Bioinfonnatic journals Perfonnance

comparisons of three types of features to characterize the phylogenetic tree images are presented

in this project The aim is to detennine the suitable features for the phylogenetic tree image

classification systeIlJ The leave-out one cross-validation was used to calculate the accuracy of

each feature In addition to that 10-fold cross-validation is also conducted in the evaluation Our

results show that the suitable combination features for the phylogenetic tree image classification

system are SIFT SURF and GIST The accuracy obtained from these combinations of the three

features can achieve just over 82 On the other hands the results show the average accuracy

obtained from the 10-fold cross-validation is 8150 Our evaluation results demonstrate the

utility of using SIFT SURF and GIST features for building phylogenetic tree image

classification system

Keywords phylogenetic tree image classification system image processing feature extraction

SIFT GIST SURF

VIII

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 12: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

ABSTRAK

Sebuah kajian telah dijalankan untuk meghasilkan sistem pengelasan automatik imej pokok

filogenetik dengan menggunakan algoritma mesin pembelajaran Kajian tersebut telah

menggunakan pembelajaran algoritma mesin diselia iaitu Mesin Vektor Sokongan (SVM) Data

imej telah dikumpulkan dari pangkalan data dalam talian PUBMED ScienceDirect dan

Bioinformatik Perbandingan antara prestasi tiga ciri-ciri pokokfilogenetik yang berbeza juga

telah ditunjukkan dalam projek ini Tujuannya adalah untuk menentukan ciri-ciri yang sesuai

untuk sistem klasifikasi pokok imej filogenetik Satu pengesahan cuti keluar salib telah

digunakan untuk mengira ketepatan bagi setiap ciri Tambahan pula 10 kali ganda silang

pengesahan akan diukurkan dalam kajian ini Hasil kajian ini telah menunjukkan bahawa cirishy

cjri gabungan yang paling sesuai bagi imej sistem klasifikasi pokokfilogenetik adalah SIFT

SURF dan GIST Ketepatan yang diperolehi daripada tiga ciri-ciri melalui gabungan boleh

memperolehi lebih daripada 8219 Selain itu hasilnya juga menunjukkan ketepatan purata

yang diperolehi daripada 10 kali ganda silang pengesahan iaitu sebanyak 8150 Hasil kajian

ini menunjukkan gabungan ciri ciri SIFT SURF dan GIST untuk melaksanakan sistem

filogenetik klasifikasi pokok ini

Kata Kunci sistem klasifikasi imej pemprosesan imej pengekstrakan ciri SIFT GIST SURF

IX

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 13: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

CHAPTER ONE

INTRODUCTION

Overview

It is an undeniable fact that the phylogenetic trees are diffusely used for evolutionary

analysis of different species organisms or genes from a collaborative ancestor (Laubach von

Haeseler amp Lercher 2012) According to the Brinkman (2005) evolution analysis is a collection

of expedients for ascertainment long-term phenotypical evolution which developed during the

year of 1990s Evolutionary analysis also refers to foundation of most bioinformatic analysis

which is evolution theory This is because the evolutionary analysis shows the ecological

characterization of the species that uses the concept of frequency dependence from gene theory

(Brinkman 2005) This chapter mainly discusses about the background of the study problem

statements research objectives research questions hypothesis and conceptual framework of the

study and significance of the study In addition this chapter also describes the definition of

relevant terms

Introduction

The evolutionary tree or phylogenetic tree is a visualization to show the relationship

between all entities according to the similarities and differences in their hereditary or physical

characteristics (Baum 2008) Therefore the way of phylogenetic tree shows the relationship

among the species was also important This can be reflectedby the way of phylogenetic tree to

demonstrate the evolution analysis of any species in this world Evolution analysis generally

iocludes the identification of analogous sequence diverse calibration phylogenetic rebuilding

and graphic representation or figure signification of the inferred tree (Dereeper et aI 2008)

Jbcse four terms can be explained through the biology evolutions According to Dereeper et ai

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 14: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

(2008) the analogous sequence is used to identify the similar sequence whereas the diverse

calibration is used to determine the difference of alignment Besides the phylogenetic rebuilding

is the process to build up the phylogenetic tree after the analogous seqence and the diverse

calibration process and then for the graphic representation or figure signification is used to show

the relationship between each species in the phylogenetic tree (Dereeper et aI 2008) This can

show that the increasing use of phylogenetic trees in biological sciences especially for biologists

who did the evolution analysis on the species Therefore the use of phylogentic tree is quite

important for the evolution analysis of life on Earth

Apart from that phylogeny is the evolutionary history of a species or group of related

species (Pagel 1999) The phylogeny can be called as the discipline of systematic classifies

organisms (Siegel-Causey Brooks amp Funk 1991) This is because phylogeny can be used to

determine organisms evolutionary relationship by systematist According to Campbell and Reece

(2008) the term systematist in this research refers to the professional who used fossil molecular

and genetic data to infer evolutionary relationships They also proposed PhyloCode which can

be used to depict the phylogenetic analysis in branching phylogenetic trees A phylogenetic

analysis presents as a collection of nodes and branch For instance the taxa that closely related

are in an evolutionary sense apppeared closely to each other whereas the taxa that distantly

related are in the different branches of the tree or there is a distance which is far from each other

in such tree

Background of the study

In the year of 1859 Darwin invented the first illustration of a phylogenetic tree (Darwin

1859) Before that shortly after his famous five years voyage as naturalist on Beagle in the year

2

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 15: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

2000

1000

of 1837 he sketched a tree diagram in his notebook (Darwin 1859) Based on the Figure I the

simple sketch was remarkably similar to modem diagrams of phylogenies (Darwin 1987)

9L-shy ~ ~ A 2$ ~laquo

~ r amp4 ~ lt- C ~ 7S _ ~ ~r p--~ -$ - 2gt

-z-a ~ ltZ- ~~-

~L-- F bull - L~ -~---r~ - - ~-------r rd 4=shy

Figure 1 The first evolution tree diagram sketched by Darwin Adapted from Charles Darwins

notebooks 1836-1844 Geology transmutation ospecies metaphysical enquiries (p 87) by Druwin c 1987 Cambridge Cambridge University Press Copyright 1987 by the P H Barrett (Ed) Adapted with pennission

o-l-lr=It=I-=-~=-lJ -------_ 1980 1985 1990 1995 2000

Year Figure 2 Cumulative number of publications in Science Citation index since 1981 that cite the tenn molecular and phylogeny in the keywords or abstract Adapted from Inferring the historical patterns of biological evolution by Pagel M 1999 Nature 401(6756) p 844 Copyright 1999 by the Pagel Adapted with pennission

3

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 16: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

First illustration of a phylogenetic tree is the first scientific argument for the theory of

advancement by means of innate selection Darwin (1998) stated that The time will come I

believe though I shall not live to see it when we shall have fairly true genealogical trees ofeach

great kingdom ofNature (p 18) In fact he mentioned that he would have the willingness to see

how modem genetics supported and confirmed by his owns ideas He provided evidence which

is not only for what had happened in the aspect of evolution but precisely how living things

evolve The forensic evidence he used for evolution was the DNA (Darwin 1859)

In fact there are few approaches used for discovering the evolution analysis of species

before the molecular phylogenetic (Campbell amp Reece 2008) In the year of 1990s

immunochemical studies were used to discover cross-reactions that stronger for closely related

organism Next in between the year of 1940s until 1960s biologists used the protein sequencing

method electrophoresis DNA hybridization and PCR that contributed to a boom in molecular

phylogeny On the other hand after publication of The Origin ofSpecies by Darwin many other

biologists came and accepted the truth of a universal Tree of Life (Darwin 1987) Then in the

late of 1970s biologists started to discover evolutionary analysis of organisms by using

molecular phylogeny One of the examples of experts from German biologists who supported

Darwins Tree of Life was the Ernst Haeckel (Larget 2011) It is very useful of using

phylogenetic trees for biologists because they can use them to describe the relations between

living creatures genomes atd genes

With the development of phylogenetic data technique there are the numbers of studies

depicting phylogenetic exploded (Pagel 1999) The number of articles publishing phylogenies

based on gene-sequence information has been increasing exponentially Figure 2 shows the data

aoalysis by using the phylogenetic tree (Pagel 1999) The phylogenies taxonomic group ranging

4

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 17: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

Pu~at Khidmat MaklulDlt Akademillt UN1VERSm MALAYSIA SUAWA)(

from viruses to bacteria fungi plans and animals (Campbell amp Reece 2008) Thus the

phylogenetic tree becomes popular and important for the evolutionary analysis of organisms

nowadays The phylogenetic tree is a branching diagram that shows the evolutionary relationship

of the organisms (Baum D 2008) Based on Darwin (1859) evolution refers to a natural

procedure to infer about the populations It can be described as the platfonn to show the

transformation in the hereditary traits of biological population over continuous generation

On the other hand phylogeny can show the similarities and differences in physical and

hereditary traits This is because there are the taxa that can attach together in the affinnation

which indicated to posse descendant from a node (Gregory 2008) Thus phylogenetic tree can

be concluded that it was similar to a family tree Moreover the construction of phylogenetic

trees is based on the similarities or differences of their physical or genetic features Few years

ago the scientists only used the tradition way which only focused on physical features of

constructing phylogenetic trees Luckily the advancement of high technologies has been led to

accumulation of huge amounts of biological data (Wan amp Che 2013) This may lead to the

changing towards the way of biological studies in various aspects

As mentioned by Wan and Che (2013) building phylogenetic trees can use the

information of interacting pathways They did apply the hierarchical clustering on two domains

of organisms which were eukaryotes and prokaryotes Using interacting pathway can increase

the effectiveness on revearing evolutionary relationships ofthe species (Wan amp Che 2013)

Phylogenetic tree was constructed using variety evidence such as generally comparing DNA

(Kaizhong Jason T amp Dennis 1996) It was an undirected acyclic connected graph Basically

the lengths of branches represented time since the groups split from each other and the node for

he tree is known as ancestors The set of exterior nodes are called leaves

5

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 18: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

Apart from constructing the phylogenetic tree the new approach nowadays can extract

the phylogenetic tree data from the literacture review In fact it is using the content mining to

extract the data from the literature review (Mounce 2012) Content mining can be split into

content and mining in explanation Content can be included anything such as the audio video

metadata text and image Besides the mining shows the huge number of data information

extraction from the content Extracting phylogenetic tree data from literacture review uses more

content mining than text mining because the content was more than just text (Mounce 2012)

In short phylogenetic trees provides a framework that shows the evolution of features

(Baum D 2008) This shows that the related species shared in many common of similar

features Next the phylogenetic trees also uses in bio-prospecting which is an optimal strategy

that exploited phylogenetic information to target closely related species to search for shared

feature of interest (Kelly Grenyer amp Scotland 2014) This shows that related species can search

for shared features in common Therefore the phylogenetic trees are useful for conservation

evaluation in choosing sets of species that can maximized the present utilitarian benefits of

extant feature diversity as well as the range of evolutionary trajectories in the future

Problem Statement of the study

With the increase volume of publication databases volume of the phylogenetic trees is

getting bigger It is because with the rapid accumulation of DNA sequence data more and more ~

phylogenetic trees are being constructed (Pagel 1999) It is technically leads to challenge and

time consuming for a researcher to search for relevant information (Dereeper et aI 2008) Next

the types of contents in these published documents are various such as images audio arts and

tables Search engines rely on texts or captions are often associated with a figure to perform a

search This makes the classification of the phylogenetic trees image one by one by the

6

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 19: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

researcher becoming challenging and waste of time Moreover if the biologist becomes

challenging and time consuming when searching for the particular phylogenetic tree this may

delay their research works Furtermore the purpose for the invented phylogentic trees is to study

the evolution analysis of the organisms Nowadays the presented phylogenetic tree mainly is

used to reuse purpose for those biologists Therefore the use of automated digization application

to search the phylogenetic trees for them is truthly needed It is because this can replace the very

challenging task of human works and determine whether an image is a phylogenetic tree

Therefore the main purpose of conducting this project is to do the automated digitation

of phylogenetic tree image classification by using machine learning algorithm This classification

is mainly focusing on the classification the images in pdf file or text file whether they are

phylogenetic tree or non-phylogenetic trees The examples of phylogenetic tree are cladogram

phenogram and tree terminology On the other hands the examples of non-phylogenetic trees are

the family tree life cycle of organisms and flow chart Figure 3 shows the pictures of non-

phylogenetic trees- family tree (Murdoch 2013)

7

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 20: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

Ebenezer Murdoch - CID - Riddick Grizel John Young shy CID shy Ann lowden 1761-1806 i 1761-1834 Mason II Shoemaker I I

Samuel lowden Murdoch -shyCID -- shy Jane Young 1784-1830 1787-1879 Shoemaker

John Muir Ebenezer Andrew McCulloch Jane John Coupland Margaret Murdoch Murdoch Murdoch Murdoch Murdoch Murdoch

1809 1810 -1864 1812 - l860 1816 - 1894 1818-1879 1820- Infant death C Uln Captai~ Shoemaker

James Murdoch shy CID shy Agnes Cumming

Mary Murdoch

1841-1929

1814 - 1900 ClJplaln

Jane Murdoch

1848-1924

Jeanie Muirhead - CID shy Samuel Murdoch -1914 I 1843-1917

Mil5UMaf1ller

1865middot1869 1867 1906 1870middot1916 Wntdeath Chimist

1873 - 1912 1e oftiagtr01 the TI14R1C

~tn these ApI~ 191 2

Agnes Murdoch

1850-1944

1818-1891

William Murdoch 1856-1906

John Murdoch lS57 -1907

uptain Iltolaquoxr

I~ tilt sea 101 In rlie ~ Ap111906 Apr~ 1907

Margaret Elisabeth Murdoch 1882 -1973

teacher headmislress

Samuel Jr - CID shy ~artha Murdoch Patience Scott

1880middot1950 Merchant

1891 middot1976

Samuel Scott Murdoch

Grizzel Samuel Alexander Charles Donaldson Murdoch Murdoch Murdoch Murdoch

1822 -1877 1824 -1888 1827 - 1868 1829 - 1860 Shoemaker uDtain uptltn

OJowrerlln ~nt Nwy

HI~ cxItnl~ ~

Figure 1 Non-phylogenetic tree- family tree Adapted from Murdoch Family Tree by Murdoch W 2013 Retrieved from httpwwwwilliammurdochnetJarticles_12_Murdoch_family_treehtml

Copyright 2013 by the Murdoch Adapted with permission

8

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 21: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

General Objective The main objective of this research study is to employ a machine

learning algorithm that can classify images into phylogenetic tree or non-phylogenetic trees

Specific Objective The specific objectives of this study are

i To employ machine learning that can predict phylogenetic tree that represent in the

Image

II To compare and contrast the different features that represent phylogenetic tree on

image

Research Question

I Can neural network be used for prediction of phylogenetic tree images

II What are the discriminative features can be used for classifier learning

I Phylogenetic tree or called as a phylogeny is a branch diagram that can illustrate

the lines of evolutionary relationships of different kinds of species organism or

genes from a common ancestor (Baum D 2008)

II Phylogeny is the evolution relationship between organisms (Baum D 2008)

1II Evolution analysis is the fundamentals or foremost of phylogenetic trees with

cautionary notes (Brinkman 2005)

iv Content Mining is defined as a significant part of figure mining which is nonshy

textual content (Mounce 2014)

9

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 22: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

This research study hopes to advance knowledge on the automated digitization images of

phylogenetic trees from the pdf file or text file as phylogenetic tree or non-phylogenetic tree

This research study is mainly focused on the rooted tree (c1adogram) and the unrooted

In conclusion phylogenetic is the science of constructing hypothesis related to the

Iutionary relationship of organisms in the fonn of phylogenetic tree Then this project is not

laquomeemed with the reconstruction of phylogenetic trees Rather it was doing the classification of

phylogenetic trees image in pdf file or text file whether it is phylogenetic trees or nonshy

ylogenetic trees by using machine learning algorithm

10

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 23: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

CHAPTER TWO

LITERATURE REVIEW

As mentioned by Mounce (2012) recently there are millions of papers published each

at an ever growing rate about the phylogenetic tree This is because the amount and

mvllICImiddothI of species with at least partial sequence information was rapidly increasing Thus

phylogenetic trees become an integral part of various biological studies with the exponential

iDcrease of sequence data which is being generated by various classical and next generation

sequence studies (Baum D 2008) This chapter divides into few sections The first section

tbcuses on phylogenetic trees which explain more on the meaning and purpose for the

ylogenetic trees and types of phylogenetic trees The next section concentrates on the feature

mimage This section also emphasizes on the suitable features that were suitable used for image

ification process Besides this section reviewed on image recognition system frameworks as

nvaoSEeoletic Tree

Phylogenetic tree or evolution tree is an illustrative representation of biological entities

were associated with common descent such as species or higher-level taxonomic

___pmJ~ (Gregory 2008) Phylogenetic tree represents a backbone for various other biological (8aum 2008) Therefore it is a graphical representation of a hypothesis about the

_tlon of a species with branches that separated hybridized or terminated by extinction

readers can read and understand the patterns of descent from the phylogenetic trees

the phylogenetic trees do not indicate when species evolved or how much genetic

11

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12

Page 24: Faculty of Cognitive Sciences and Human Development Tree Classification... · Figure 4: phylogenetic rooted-tree: rectangular cladogram ..... 13 Figure 5: phylogenetic rooted-tree:

CD8ogeoccurred in a lineage (Kelly Grenyer amp Scotland 2014) This is because phylogenetic

should not be assumed that a taxon can be evolved from the taxon next to it

Baum Smitch and Donovan (2005) stated that phylogenetic tree is the most direct

itllltrgttilln of the principle of common ancestry This is because phylogenetic tree is very crucial

r evolutionary theory In fact they were trying to tell the readers that practical understanding

ofwhat phylogenetic tree represented is really important in understand the evolution relationship

( the species Thus the phylogenetic trees become important in the evolution analysis of any

species as the biologists should increase the use of phylogentic trees in biological sciences Next

ylogenetic trees provides an efficient structure for organizing biodiversity info Moreover it

elopes accurate conception of totality of evolutionary history Therefore it is important for

aspiring biologists to develop the understanding of phylogenetic trees

of Phylogenetic Tree

Phylogenetic trees can be divided into different kinds of trees There were two main

ories including the phylogenetic rooted trees and the phylogenetic unrooted trees Apart

the two main categories the phylogenetic tree can represent in several form slanted

iIIIiIIIWJrlm Figure 4 (Phylogenetic tree 2002) rectangular cladogram Figure 5 (Phylogenetic

2002) and circular cladogram Figure 6 (Phylogenetic tree 2002) Roots can be artificially

to unrooted trees by means of a species that had unambiguously separated early from

species being considered (Bacardit 2009)

12