linear-time computation of similarity measures for sequential data

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Linear-Time Computation of Similarity Measures for Sequential Data

Presenter : Cheng-Feng Weng

Authors : Konrad Rieck and Pavel Laskov

2008/09/11

ML.26 (2008)

N.Y.U.S.T.

I. M.


2

Outline

Introduction Motivation Objective Methods Experimental results Conclusion Comments

N.Y.U.S.T.

I. M.


3

Introduction

Sequential data is a fundamental data representations in computer science. search engines to document ranking, gene finding to

prediction of protein functions, network surveillance tools to anti-virus programs

Providing an interface to sequential data is therefore an essential prerequisite for applications of machine learning in these domains.

…ATGCAACTAAT….DNA sequence

N.Y.U.S.T.

I. M.


4

Motivation

Most of learning algorithms imposes a much looser constraint on the type of data that can be handled. a powerful abstraction between algorithms and data

representations must be established.

Numerous applications exist for which relationships are defined as metric or non-metric distances for similarity measure. It is imperative to address pairwise comparison of

objects in a most general setup.

N.Y.U.S.T.

I. M.


5

Objective

The aim of this contribution is to develop a general framework for pairwise comparison of sequences. The authors focus on algorithms with linear-time

asymptotic complexity in the sequence lengths

It also provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures.

N.Y.U.S.T.

I. M.


6

Embedding Sequences using a Formal Language

The authors focus on three definitions for embedding language.

Bag-of-Words: L = Dictionary (explicit), L = (A\ D) (implicit).∗

This is a book.

this is a book

ADDACCTACA

ADD ACCT AC A

N.Y.U.S.T.

I. M.


7

Embedding Sequences using a Formal Language (con.)

K-grams:

Contiguous sequences:

)( gramskAL k abbaac (k=4)

abba bbaa baac

abbac abbbbbbbad

N.Y.U.S.T.

I. M.


8

The embedding function Given an embedding language L, a sequence x can be

mapped into the |L|-dimensional feature space by calculating a function φw(x) for every w L ∈appearing in x.

frequency, probability or binary flag

a weight

N.Y.U.S.T.

I. M.


9

Weighting schemes

The following three weighting schemes for defining W have been proposed in previous research: Corpus dependent weighting:

Length dependent weighting:

Position dependent weighting: Decay factor 0 ≤ λ ≤ 1

N.Y.U.S.T.

I. M.


10

A Generic Framework for Similarity Measures

All of the similarity measures share a similar mathematical construction: an inner component-wise function is aggregated over

each dimension using an outer operator.

Inner functionOuter operator

N.Y.U.S.T.

I. M.


11

A Generic Framework for Similarity Measures (cont.)

Unified formulation of similarity measures:

N.Y.U.S.T.

I. M.


12

Define m(0,0) = e, where e is the neutral element for the operator ⊕ .

Conjunctive similarity measures:

Disjunctive similarity measures:

A Generic Framework for Similarity Measures (cont.)

N.Y.U.S.T.

I. M.


13

Algorithms and Data Structures

The authors present three approaches differing in capabilities and implementation complexity covering simple sorted arrays, tries and generalized suffix trees.

The sorted arrays are simple but limited in capabilities, tries are more involved, yet they do not cover all embedding languages and generalized suffix trees are relatively complex and support the full range of embedding languages.

N.Y.U.S.T.

I. M.


14

Sorted Arrays Sorted arrays of 3-grams

for x = abbaa and y = baaaab.

Disjunctive

N.Y.U.S.T.

I. M.


15

Tries Tries of 3-grams for x =

abbaa and y = baaaab.

word

Root = nil

N.Y.U.S.T.

I. M.


16

Generalized Suffix Trees Generalized suffix

tree for x = abbaa$1 and y = baaaab$2. occ(w,x),occ(w,y

)

N.Y.U.S.T.

I. M.


17

Generalized Suffix Trees (cont.)

Construct the tree

N.Y.U.S.T.

I. M.


18

Run-time Experiments

Embedding language: bag-of-words.(textual data)

N.Y.U.S.T.

I. M.


19

Run-time Experiments

Embedding language: k-grams.(all data sets)

N.Y.U.S.T.

I. M.


20

Applications

Unsupervised text categorization.

better

N.Y.U.S.T.

I. M.


21

Applications

Network intrusion detection.

N.Y.U.S.T.

I. M.


22

Applications

Transcription start site recognition.

N.Y.U.S.T.

I. M.


23

Conclusions

The framework for comparison of sequences proposed in this article provides means for efficient computation of a large variety of similarity measures. Including kernels, distances and non-metric similarity coefficients.

As realizations of the framework it provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Sorted arrays are the most efficient but more limit to apply. Generalized suffix trees can handle unrestricted embedding languages

but more cost.

N.Y.U.S.T.

I. M.


24

Comments Advantage

Practical for these domain Drawback

Uncleanly, too many references Application

…

linear-time computation of similarity measures for sequential data

Documents