linear-time computation of similarity measures for sequential data
DESCRIPTION
Linear-Time Computation of Similarity Measures for Sequential Data. Presenter : Cheng-Feng Weng Authors : Konrad Rieck and Pavel Laskov 2008/09/11. ML.26 (2008). Outline. Introduction Motivation Objective Methods Experimental results Conclusion Comments. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Linear-Time Computation of Similarity Measures for Sequential Data
Presenter : Cheng-Feng Weng
Authors : Konrad Rieck and Pavel Laskov
2008/09/11
ML.26 (2008)
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
2
Outline
Introduction Motivation Objective Methods Experimental results Conclusion Comments
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
3
Introduction
Sequential data is a fundamental data representations in computer science. search engines to document ranking, gene finding to
prediction of protein functions, network surveillance tools to anti-virus programs
Providing an interface to sequential data is therefore an essential prerequisite for applications of machine learning in these domains.
…ATGCAACTAAT….DNA sequence
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
4
Motivation
Most of learning algorithms imposes a much looser constraint on the type of data that can be handled. a powerful abstraction between algorithms and data
representations must be established.
Numerous applications exist for which relationships are defined as metric or non-metric distances for similarity measure. It is imperative to address pairwise comparison of
objects in a most general setup.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
5
Objective
The aim of this contribution is to develop a general framework for pairwise comparison of sequences. The authors focus on algorithms with linear-time
asymptotic complexity in the sequence lengths
It also provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
6
Embedding Sequences using a Formal Language
The authors focus on three definitions for embedding language.
Bag-of-Words: L = Dictionary (explicit), L = (A\ D) (implicit).∗
This is a book.
this is a book
ADDACCTACA
ADD ACCT AC A
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
7
Embedding Sequences using a Formal Language (con.)
K-grams:
Contiguous sequences:
)( gramskAL k abbaac (k=4)
abba bbaa baac
abbac abbbbbbbad
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
8
The embedding function Given an embedding language L, a sequence x can be
mapped into the |L|-dimensional feature space by calculating a function φw(x) for every w L ∈appearing in x.
frequency, probability or binary flag
a weight
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
9
Weighting schemes
The following three weighting schemes for defining W have been proposed in previous research: Corpus dependent weighting:
Length dependent weighting:
Position dependent weighting: Decay factor 0 ≤ λ ≤ 1
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
10
A Generic Framework for Similarity Measures
All of the similarity measures share a similar mathematical construction: an inner component-wise function is aggregated over
each dimension using an outer operator.
Inner functionOuter operator
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
11
A Generic Framework for Similarity Measures (cont.)
Unified formulation of similarity measures:
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
12
Define m(0,0) = e, where e is the neutral element for the operator ⊕ .
Conjunctive similarity measures:
Disjunctive similarity measures:
A Generic Framework for Similarity Measures (cont.)
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
13
Algorithms and Data Structures
The authors present three approaches differing in capabilities and implementation complexity covering simple sorted arrays, tries and generalized suffix trees.
The sorted arrays are simple but limited in capabilities, tries are more involved, yet they do not cover all embedding languages and generalized suffix trees are relatively complex and support the full range of embedding languages.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
14
Sorted Arrays Sorted arrays of 3-grams
for x = abbaa and y = baaaab.
Disjunctive
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
15
Tries Tries of 3-grams for x =
abbaa and y = baaaab.
word
Root = nil
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
16
Generalized Suffix Trees Generalized suffix
tree for x = abbaa$1 and y = baaaab$2. occ(w,x),occ(w,y
)
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
17
Generalized Suffix Trees (cont.)
Construct the tree
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
18
Run-time Experiments
Embedding language: bag-of-words.(textual data)
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
19
Run-time Experiments
Embedding language: k-grams.(all data sets)
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
20
Applications
Unsupervised text categorization.
better
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
21
Applications
Network intrusion detection.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
22
Applications
Transcription start site recognition.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
23
Conclusions
The framework for comparison of sequences proposed in this article provides means for efficient computation of a large variety of similarity measures. Including kernels, distances and non-metric similarity coefficients.
As realizations of the framework it provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Sorted arrays are the most efficient but more limit to apply. Generalized suffix trees can handle unrestricted embedding languages
but more cost.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
24
Comments Advantage
Practical for these domain Drawback
Uncleanly, too many references Application
…