linear-time computation of similarity measures for sequential data

24
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology Linear-Time Computation of Similarity Measures for Sequential Data Presenter : Cheng-Feng Weng Authors : Konrad Rieck and Pavel Laskov 2008/09/11 ML.26 (2008)

Upload: clove

Post on 22-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Linear-Time Computation of Similarity Measures for Sequential Data. Presenter : Cheng-Feng Weng Authors : Konrad Rieck and Pavel Laskov 2008/09/11. ML.26 (2008). Outline. Introduction Motivation Objective Methods Experimental results Conclusion Comments. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Linear-Time Computation of Similarity Measures for Sequential Data

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Linear-Time Computation of Similarity Measures for Sequential Data

Presenter : Cheng-Feng Weng

Authors : Konrad Rieck and Pavel Laskov

2008/09/11

ML.26 (2008)

Page 2: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

2

Outline

Introduction Motivation Objective Methods Experimental results Conclusion Comments

Page 3: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

3

Introduction

Sequential data is a fundamental data representations in computer science. search engines to document ranking, gene finding to

prediction of protein functions, network surveillance tools to anti-virus programs

Providing an interface to sequential data is therefore an essential prerequisite for applications of machine learning in these domains.

…ATGCAACTAAT….DNA sequence

Page 4: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

4

Motivation

Most of learning algorithms imposes a much looser constraint on the type of data that can be handled. a powerful abstraction between algorithms and data

representations must be established.

Numerous applications exist for which relationships are defined as metric or non-metric distances for similarity measure. It is imperative to address pairwise comparison of

objects in a most general setup.

Page 5: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

5

Objective

The aim of this contribution is to develop a general framework for pairwise comparison of sequences. The authors focus on algorithms with linear-time

asymptotic complexity in the sequence lengths

It also provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures.

Page 6: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

6

Embedding Sequences using a Formal Language

The authors focus on three definitions for embedding language.

Bag-of-Words: L = Dictionary (explicit), L = (A\ D) (implicit).∗

This is a book.

this is a book

ADDACCTACA

ADD ACCT AC A

Page 7: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

7

Embedding Sequences using a Formal Language (con.)

K-grams:

Contiguous sequences:

)( gramskAL k abbaac (k=4)

abba bbaa baac

abbac abbbbbbbad

Page 8: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

8

The embedding function Given an embedding language L, a sequence x can be

mapped into the |L|-dimensional feature space by calculating a function φw(x) for every w L ∈appearing in x.

frequency, probability or binary flag

a weight

Page 9: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

9

Weighting schemes

The following three weighting schemes for defining W have been proposed in previous research: Corpus dependent weighting:

Length dependent weighting:

Position dependent weighting: Decay factor 0 ≤ λ ≤ 1

Page 10: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

10

A Generic Framework for Similarity Measures

All of the similarity measures share a similar mathematical construction: an inner component-wise function is aggregated over

each dimension using an outer operator.

Inner functionOuter operator

Page 11: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

11

A Generic Framework for Similarity Measures (cont.)

Unified formulation of similarity measures:

Page 12: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

12

Define m(0,0) = e, where e is the neutral element for the operator ⊕ .

Conjunctive similarity measures:

Disjunctive similarity measures:

A Generic Framework for Similarity Measures (cont.)

Page 13: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

13

Algorithms and Data Structures

The authors present three approaches differing in capabilities and implementation complexity covering simple sorted arrays, tries and generalized suffix trees.

The sorted arrays are simple but limited in capabilities, tries are more involved, yet they do not cover all embedding languages and generalized suffix trees are relatively complex and support the full range of embedding languages.

Page 14: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

14

Sorted Arrays Sorted arrays of 3-grams

for x = abbaa and y = baaaab.

Disjunctive

Page 15: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

15

Tries Tries of 3-grams for x =

abbaa and y = baaaab.

word

Root = nil

Page 16: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

16

Generalized Suffix Trees Generalized suffix

tree for x = abbaa$1 and y = baaaab$2. occ(w,x),occ(w,y

)

Page 17: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

17

Generalized Suffix Trees (cont.)

Construct the tree

Page 18: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

18

Run-time Experiments

Embedding language: bag-of-words.(textual data)

Page 19: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

19

Run-time Experiments

Embedding language: k-grams.(all data sets)

Page 20: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

20

Applications

Unsupervised text categorization.

better

Page 21: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

21

Applications

Network intrusion detection.

Page 22: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

22

Applications

Transcription start site recognition.

Page 23: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

23

Conclusions

The framework for comparison of sequences proposed in this article provides means for efficient computation of a large variety of similarity measures. Including kernels, distances and non-metric similarity coefficients.

As realizations of the framework it provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Sorted arrays are the most efficient but more limit to apply. Generalized suffix trees can handle unrestricted embedding languages

but more cost.

Page 24: Linear-Time Computation of Similarity Measures for Sequential Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

24

Comments Advantage

Practical for these domain Drawback

Uncleanly, too many references Application