machine learning techniques for bacteria...

29
Machine Learning Techniques for Bacteria Classification Massimo La Rosa Riccardo Rizzo Alfonso M. Urso S. Gaglio ICAR-CNR University of Palermo Workshop on Hardware Architectures Beyond 2020: Challenges and Opportunities for Computational Biology and Bioinformatics Napoli – December 19, 2007

Upload: others

Post on 06-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Machine Learning Techniques for Bacteria Classification

Massimo La RosaRiccardo RizzoAlfonso M. UrsoS. Gaglio

ICAR-CNR

University of Palermo

Workshop on Hardware Architectures Beyond 2020:

Challenges and Opportunities for Computational Biology and Bioinformatics

Napoli – December 19, 2007

Page 2: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Motivation

Results

Conclusions

A new approach to microbial identification

MethodologiesSelf organizingTopographic map

Deterministic annealing

Goals Genotypic feature based taxonomyVisualization

Outline

A methodology to create a visualization and classification tool

Page 3: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Motivation

Results

Conclusions

A new approach to microbial identification

MethodologiesSelf organizingTopographic map

Deterministic annealing

Goals Genotypic feature based taxonomyVisualization

Outline

A methodology to create a visualization and classification tool

Page 4: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Motivation

● Microbial identification is crucial for the study of infectious diseases.

● Bacterial taxonomy is usually based on phenotypic characters

● A new approach based on bacteria genotype is under development

● 16S rRNA “housekeeping” gene for taxonomic purposes

Page 5: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Motivation

Results

Conclusions

A new approach to microbial identification

MethodologiesSelf organizingTopographic map

Deterministic annealing

Goals Genotypic feature based taxonomyVisualization

Outline

A methodology to create a visualization and classification tool

Page 6: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Goal

● Genotypic features based taxonomy ● Topographic representation of the

bacteria clusters– Finding misclassification = discovery of new

pathogens

– Classifying organisms with an unusual phenotype

Page 7: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Motivation

Results

Conclusions

A new approach to microbial identification

MethodologiesSelf organizingTopographic map

Deterministic annealing

Goals Genotypic feature based taxonomyVisualization

Outline

A methodology to create a visualization and classification tool

Page 8: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Methodologies

● General framework

● Building Dataset

● Sequence Alignment

● Evolutionary Distance

● Soft Topographic Map Algorithm

Page 9: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

1 downloading and filtering gene sequences from NCBI databases

2 sequence alignment (Needleman-Wunsch)

3 computing dissimilarity matrix (evolutionary distance)

4 clustering (SOM on pairwise distances) and visualization (UMatrix style map)

Sequence

DB

Results

Filtering

Taxonomy

Retrieval

Sequence

Alignment

Labeling

Computing

Distance

Clustering

and

mapping

Sequence

File 1

2 3 4

General framework

Page 10: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Building Dataset

Phylum BXII (Proteobacteria)Class III (Gammaproteobacteria)

14 Orders

147 16S rRNA gene sequencesdownloaded from GenBank database

Page 11: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Sequence Alignment

● Sequence alignment allows to compare homologous sites of the same gene between two different species

● Two well known alignment algorithms used:

– ClustalW: multiple-alignment

– Needleman-Wunsch: pairwise alignment

Page 12: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Evolutionary Distance

● The simplest type of distance is the number of nucleotide substitutions per site.

– Warning: it underrates real distances● Jukes and Cantor method was used: it

provides a better estimate of evolutionary distances

● Evolutionary distances are elements of the symmetric dissimilarity matrix:

Type strain 1 2 3 4 5 6 71 0 0.06286 0.11215 0.06482 0.05128 0.09451 0.067852 0 0.10608 0.0579 0.065 0.07196 0.046823 0 0.1224 0.11418 0.10279 0.115384 0 0.06082 0.10224 0.067645 0 0.10595 0.073626 0 0.082327 0...

Page 13: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Soft Topographic Map Algorithm (1)

● Extension of Kohonen's SOM for pairwise data● The position of bacteria clusters in the topographic

maps is based on the optimization, through deterministic annealing technique, of a cost function that takes its minimum when each data point is mapped to the best matching neuron

Soft Topographic Map Algorithm

input

Dissimilarity matrix

Topographic map showing relationships among Bacteria clusters

Type strain 1 2 31 0 0,06286 0,112152 0 0,106083 0

output

Page 14: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Soft Topographic Map Algorithm (2)

Partial assignment cost :

e t r=∑s

hrs∑t '

at ' s dtt '−12∑t ' '

at ' ' sd t ' t ' ' ,∀ t , r

Weighting factor :

a t r=∑s

hrs P x t∈C s

∑t '∑s

hrsP xt '∈C s,∀ t , r

Assignment probability :

P x t∈C r=exp −et r

∑u

exp−e t u,∀ t , r

Neighborhood function :

hr s=exp −∣r−s∣2

22 ,∀ r , s1) Initialization Step:

a) put e t rn t r ,∀ t , r ,∈[0,1]

b) compute lookup table for hrsc) choose initial value of , final

, increasing temperature factor , threshold

2) Training Step:a) while final (Annealing cycle)

i. repeat (EM cycle)A) E step: compute

P x t∈C r∀ t , rB) M step: compute

a t rnew ,∀ t , r

C) M step: computee t rnew ,∀ t , r

ii. until ∥e t rnew−e t r

old∥

iii. put

b) end while

1) Initialization Step:a) put e t rn t r ,∀ t , r ,∈[0,1]

b) compute lookup table for hrsc) choose initial value of ,

final , increasing temperaturefactor , threshold

2) Training Step:a) while final (Annealing cycle)

i. repeat (EM cycle)A) E step: compute

P x t∈C r∀ t , rB) M step: compute

a t rnew ,∀ t , r

C) M step: computee t rnew ,∀ t , r

ii. until ∥e t rnew−e t r

old∥

iii. put

b) end while

Page 15: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Soft Topographic mapInput vector

The topographic map is a lattice (two dimensional in our case) that self organize in the pattern space

Page 16: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Soft Topographic Map Algorithm (3)

● The algorithm that “moves” this lattice is called deterministic annealing– The advantage of deterministic annealing is

to find a global minimum of the approximation error

Partial assignment cost :

e t r=∑s

hrs∑t '

at ' s d tt '−12∑t ' '

a t ' ' sd t ' t ' ' ,∀ t , r

Weighting factor :

a t r=∑s

hrs P x t∈C s

∑t '∑s

hrs P x t '∈C s,∀ t , r

Assignment probability :

P x t∈C r=exp −e t r

∑u

exp−et u,∀ t , r

Neighborhood function :

hr s=exp−∣r−s∣2

2 2 ,∀ r , s1) Initialization Step:

a) put e t rn t r ,∀ t , r ,∈[0,1]

b) compute lookup table for hrsc) choose initial value of ,

final , increasing temperaturefactor , threshold

2) Training Step:a) while final (Annealing cycle)

i. repeat (EM cycle)A) E step: compute

P x t∈C r∀ t , r

B) M step: computea t rnew ,∀ t , r

C) M step: computee t rnew ,∀ t , r

ii. until ∥e t rnew−e t r

old∥

iii. put

b) end while

Deterministicannealing

Page 17: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Motivation

Results

Conclusions

A new approach to microbial identification

MethodologiesSelf organizingTopographic map

Deterministic annealing

Goals Genotypic feature based taxonomyVisualization

Outline

A methodology to create a visualization and classification tool

Page 18: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Experimental Results

● From 8x8 up to 45x45 map dimensions● We trained 20 maps of each geometry in order to

avoid the dependence from the initial conditions● The results obtained using the two alignments

methods do not present any significant difference

12x12 map 16x16 map 20x20 map

Page 19: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

138045x45

66040x40

36035x35

24030x30

9025x25

3020x20

2319x19

1718x18

1617x17

1216x16

915x15

614x14

413x13

412x12

211x11

210x10

19x9

0,68x8

Average processing time (min.)Map size

Experimental tests

● Hardware resources

● 16 nodes cluster, dual processor Xeon 3.4 GHz, 4 GB RAM, 6 TB storage, Myrinet-Fiber communication

● Software

● Languages: Java, Python

● Libraries: BioJava, Jama ....

Page 20: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Mixed Clusters

Map Evaluation

Page 21: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

We only have distances between patterns, and no metrics!

Usually topology measures are considered, but in our case there is not a space that contains the patterns (sequences)

Probable topology distortion

Map Evaluation

Ideal Map

Page 22: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Map evaluation

● We take rows and columns of the maps and compare the order of the elements in map with the order obtained from the dissimilarity matrix

Page 23: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Map evaluation

This sequence...

...is compared with...

Dissimilarity matrix

Type strain 1 2 31 0 0.06286 0.112152 0 0.106083 0

...the sequence of the same objects

obtained from the dissimilarity matrix

This comparison is made using the Spearman coefficient in order to obtain a similarity value among the two sequences.

Of course the two sequences should be the same in a good map

Page 24: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Map Evaluation

Page 25: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Map evaluation

Low number of mixed clustersLow value of Spearman Coefficient

The “Best” Map

● We have an index for each map and we can see that some geometry are better than other

Page 26: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

The “Best” Map

Page 27: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Comparison with the phylogenetic tree

Page 28: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Motivation

Results

Conclusions

A new approach to microbial identification

MethodologiesSelf organizingTopographic map

Deterministic annealing

Goals Genotypic feature based taxonomyVisualization

Outline

A methodology to create a visualization and classification tool

Page 29: Machine Learning Techniques for Bacteria Classificationmariog/Workshop2007/presentazioni/Urso.pdf · 35x35 360 30x30 240 25x25 90 20x20 30 19x19 23 18x18 17 17x17 16 16x16 12 15x15

Conclusions

● Soft Topographic Map for clustering and classification of bacteria

● Genotype based taxonomy● Detecting singular situations● Further analysis with other

housekeeping genes or using other distance algorithms, e.g. Normalized Compressed Distance