processing compressed texts: a tractability borderyura/talks/compressed-cpm-talk.pdf · software...

44
Processing Compressed Texts: A Tractability Border Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura Combinatorial Pattern Matching 2007 Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 1 / 23

Upload: others

Post on 16-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Processing Compressed Texts:A Tractability Border

Yury LifshitsSteklov Institute of Mathematics at St.Petersburg

http://logic.pdmi.ras.ru/~yura

Combinatorial Pattern Matching 2007

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 1 / 23

Page 2: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Consider two compressed texts

=?

Can we say whether they are equalwithout unpacking them?

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 2 / 23

Page 3: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Consider two compressed texts

=?

Can we say whether they are equalwithout unpacking them?

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 2 / 23

Page 4: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Outline

1 Popular topic in stringology: algorithms forcompressed texts

2 New algorithm for fully compressed pattern

matching

3 Tractability border for processing compressed

texts

4 Open problems

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 3 / 23

Page 5: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Outline

1 Popular topic in stringology: algorithms forcompressed texts

2 New algorithm for fully compressed pattern

matching

3 Tractability border for processing compressed

texts

4 Open problems

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 3 / 23

Page 6: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Outline

1 Popular topic in stringology: algorithms forcompressed texts

2 New algorithm for fully compressed pattern

matching

3 Tractability border for processing compressed

texts

4 Open problems

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 3 / 23

Page 7: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Outline

1 Popular topic in stringology: algorithms forcompressed texts

2 New algorithm for fully compressed pattern

matching

3 Tractability border for processing compressed

texts

4 Open problems

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 3 / 23

Page 8: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Part I

Formulating the problem

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 4 / 23

Page 9: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Straight-line Programs: Definition

Straight-line program(SLP) is a context-free grammar

generating exactly one string

Two types of productions:Xi → a and Xi → XpXq

Example

abaababaabaab

X1 → bX2 → aX3 → X2X1

X4 → X3X2

X5 → X4X3

X6 → X5X4

X7 → X6X5

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 5 / 23

Page 10: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 6 / 23

Page 11: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Rytter Theorem

Theorem (Rytter’03)

Resulting archive of most practical compression methodscan be transformed into SLP generating the sameoriginal text. Two nice properties:

1 SLP has almost the same size as initial archive

2 Transformation is fast

Notable exception: compression schemes based onBurrows-Wheeler transform

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 7 / 23

Page 12: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Rytter Theorem

Theorem (Rytter’03)

Resulting archive of most practical compression methodscan be transformed into SLP generating the sameoriginal text. Two nice properties:

1 SLP has almost the same size as initial archive

2 Transformation is fast

Notable exception: compression schemes based onBurrows-Wheeler transform

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 7 / 23

Page 13: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Why algorithms on compressed texts?

Problems with implicitly given objects:word equations [P04], program equivalence [LR06],message sequence chart verification [GM02]

Acceleration by text compression:HMM training and decoding [MWZU07]

Connections to practical problems:software verification, database compression,multimedia search

Complexity surprises:many problems have unexpected computationalcomplexity when input is given in compressed form

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 8 / 23

Page 14: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Why algorithms on compressed texts?

Problems with implicitly given objects:word equations [P04], program equivalence [LR06],message sequence chart verification [GM02]

Acceleration by text compression:HMM training and decoding [MWZU07]

Connections to practical problems:software verification, database compression,multimedia search

Complexity surprises:many problems have unexpected computationalcomplexity when input is given in compressed form

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 8 / 23

Page 15: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Why algorithms on compressed texts?

Problems with implicitly given objects:word equations [P04], program equivalence [LR06],message sequence chart verification [GM02]

Acceleration by text compression:HMM training and decoding [MWZU07]

Connections to practical problems:software verification, database compression,multimedia search

Complexity surprises:many problems have unexpected computationalcomplexity when input is given in compressed form

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 8 / 23

Page 16: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Why algorithms on compressed texts?

Problems with implicitly given objects:word equations [P04], program equivalence [LR06],message sequence chart verification [GM02]

Acceleration by text compression:HMM training and decoding [MWZU07]

Connections to practical problems:software verification, database compression,multimedia search

Complexity surprises:many problems have unexpected computationalcomplexity when input is given in compressed form

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 8 / 23

Page 17: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Comparison Problems

Consider two straight-line programs P ,Qgenerating texts P and Q. Tasks:

1 Determine whether P = Q

2 Fully compressed pattern matching:check whether P is a substring of Q

3 Compute Hamming distance between P and Q

Let m, n be the sizes of P ,Q. Best previously knownresult is O(n2m2) algorithm for (1) and (2) MST’97

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 9 / 23

Page 18: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Comparison Problems

Consider two straight-line programs P ,Qgenerating texts P and Q. Tasks:

1 Determine whether P = Q

2 Fully compressed pattern matching:check whether P is a substring of Q

3 Compute Hamming distance between P and Q

Let m, n be the sizes of P ,Q. Best previously knownresult is O(n2m2) algorithm for (1) and (2) MST’97

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 9 / 23

Page 19: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Example

Do they generate the same text?

X1 → bX2 → aX3 → X2X1

X4 → X3X2

X5 → X4X3

X6 → X5X4

X1

Y1 → bY2 → aY3 → Y1Y2

Y4 → Y2Y3

Y5 → Y3Y4

Y6 → Y4Y5

X1

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 10 / 23

Page 20: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Example

Do they generate the same text?

X1 → bX2 → aX3 → X2X1

X4 → X3X2

X5 → X4X3

X6 → X5X4

X7 → X6X5

Y1 → bY2 → aY3 → Y1Y2

Y4 → Y2Y3

Y5 → Y3Y4

Y6 → Y4Y5

Y7 → Y5Y6

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 10 / 23

Page 21: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Part II

Fully compressed pattern matching

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 11 / 23

Page 22: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

New FCPM Algorithm

MAIN RESULT 1Fully compressed pattern matching canbe solved in O(n2m) time

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 12 / 23

Page 23: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Dynamic Programming (1/2)

X1 → bX2 → aX3 → X2X1

X4 → X3X2

X5 → X4X3

X6 → X5X4

X1

Y1 → bY2 → aY3 → Y1Y2

Y4 → Y2Y3

Y5 → Y3Y4

Y6 → Y4Y5

Y7 → Y5Y6

For every i , j we compute occurrences of Xi in Yj

in lexicographic order of pairs: from (1, 1) to (6, 7)

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 13 / 23

Page 24: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Dynamic Programming (2/2)

SearchingX3 X2

X4

inY6

From previous steps of dynamic programming:X3 occurrences in Y6

X2 occurrences in Y6.

Take X3 occurrences, X2 occurrences, shift and intersect!

Complexity of one step of dynamic programming: O(n)Number of steps in dynamic programming: O(nm)

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 14 / 23

Page 25: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Dynamic Programming (2/2)

SearchingX3 X2

X4

inY6

From previous steps of dynamic programming:X3 occurrences in Y6

X2 occurrences in Y6.

Take X3 occurrences, X2 occurrences, shift and intersect!

Complexity of one step of dynamic programming: O(n)Number of steps in dynamic programming: O(nm)

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 14 / 23

Page 26: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Dynamic Programming (2/2)

SearchingX3 X2

X4

inY6

From previous steps of dynamic programming:X3 occurrences in Y6

X2 occurrences in Y6.

Take X3 occurrences, X2 occurrences, shift and intersect!

Complexity of one step of dynamic programming: O(n)Number of steps in dynamic programming: O(nm)

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 14 / 23

Page 27: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Part III

Tractability Border

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 15 / 23

Page 28: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Check Your Intuition

When two SLPs of sizes n, m are given we can checkequivalence between texts generated by them in timeO(min(n2m, m2n))

What is the complexity of computing Hamming distancebetween two SLP-generated texts?

MAIN RESULT 2The problem of computing Hamming distance betweenSLP-generated texts is #P-complete

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 16 / 23

Page 29: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Check Your Intuition

When two SLPs of sizes n, m are given we can checkequivalence between texts generated by them in timeO(min(n2m, m2n))

What is the complexity of computing Hamming distancebetween two SLP-generated texts?

MAIN RESULT 2The problem of computing Hamming distance betweenSLP-generated texts is #P-complete

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 16 / 23

Page 30: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Check Your Intuition

When two SLPs of sizes n, m are given we can checkequivalence between texts generated by them in timeO(min(n2m, m2n))

What is the complexity of computing Hamming distancebetween two SLP-generated texts?

MAIN RESULT 2The problem of computing Hamming distance betweenSLP-generated texts is #P-complete

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 16 / 23

Page 31: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Tractability Border

When input texts are compressed:

∃ poly algorithm:AGKPR’96 EquivalenceGKPR’96 Regular LanguageMembershipKPR’95 Fully CompressedPattern MatchingCGLM’06 Window SubsequenceMatchingGKPR’96 Shortest PeriodL’07 Shortest Cover

At least NP-hard:AL’07 Hamming distanceLohrey’04 Context-FreeLanguage MembershipLL’06 Fully CompressedSubsequence MatchingBKLPR’02 Two-dimensionalCompressed Pattern MatchingAA

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 17 / 23

Page 32: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Part IV

Open Problems

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 18 / 23

Page 33: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

OP1: Longest Common Substring

Input: Two SLPs generating texts P and Q

Task: Compute the length of the longest commonsubstring of P and Q

Can we do it in polynomial time?

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 19 / 23

Page 34: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

OP1: Longest Common Substring

Input: Two SLPs generating texts P and Q

Task: Compute the length of the longest commonsubstring of P and Q

Can we do it in polynomial time?

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 19 / 23

Page 35: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

OP2: SLP for OR

Input: Two SLPs of sizes n, m generating binary stringsP and Q of the same length

Task: Compute a close-to-minimal SLPgenerating “bitwise OR between P and Q”

Can we do it in time poly(n + m + output)?

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 20 / 23

Page 36: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

OP2: SLP for OR

Input: Two SLPs of sizes n, m generating binary stringsP and Q of the same length

Task: Compute a close-to-minimal SLPgenerating “bitwise OR between P and Q”

Can we do it in time poly(n + m + output)?

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 20 / 23

Page 37: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

OP3: Edit Distance

Input: Text P of length l and SLP of size m generatingtext Q

Input: Compute edit distance between P and Q

Can we do it in O(lm) time?

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 21 / 23

Page 38: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

OP3: Edit Distance

Input: Text P of length l and SLP of size m generatingtext Q

Input: Compute edit distance between P and Q

Can we do it in O(lm) time?

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 21 / 23

Page 39: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Summary

Straight-line programs is a wide-accepted model ofcompressed text

Fully compressed pattern matching can be solved inO(n2m) time

Computing Hamming distance between compressedtexts is #P-complete

Open problems: longest common substring, bitwiseOR and edit distance for compressed texts

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 22 / 23

Page 40: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Summary

Straight-line programs is a wide-accepted model ofcompressed text

Fully compressed pattern matching can be solved inO(n2m) time

Computing Hamming distance between compressedtexts is #P-complete

Open problems: longest common substring, bitwiseOR and edit distance for compressed texts

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 22 / 23

Page 41: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Summary

Straight-line programs is a wide-accepted model ofcompressed text

Fully compressed pattern matching can be solved inO(n2m) time

Computing Hamming distance between compressedtexts is #P-complete

Open problems: longest common substring, bitwiseOR and edit distance for compressed texts

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 22 / 23

Page 42: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Summary

Straight-line programs is a wide-accepted model ofcompressed text

Fully compressed pattern matching can be solved inO(n2m) time

Computing Hamming distance between compressedtexts is #P-complete

Open problems: longest common substring, bitwiseOR and edit distance for compressed texts

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 22 / 23

Page 43: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Last SlideSearch “Lifshits” or visit http://logic.pdmi.ras.ru/~yura/

Yury Lifshits

Processing Compressed Texts: A Tractability Border

CPM’07 // on-line version

Yury Lifshits and Markus Lohrey

Querying and Embedding Compressed Texts

MFCS’06 // on-line version

Patrick Cegielski, Irene Guessarian, Yury Lifshits and Yuri Matiyasevich

Window Subsequence Problems for Compressed Texts

CSR’06 // on-line version

Thank you for your attention! Questions?

Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 23 / 23

Page 44: Processing Compressed Texts: A Tractability Borderyura/talks/compressed-cpm-talk.pdf · software verification, database compression, multimedia search Complexity surprises: many

Last SlideSearch “Lifshits” or visit http://logic.pdmi.ras.ru/~yura/

Yury Lifshits

Processing Compressed Texts: A Tractability Border

CPM’07 // on-line version

Yury Lifshits and Markus Lohrey

Querying and Embedding Compressed Texts

MFCS’06 // on-line version

Patrick Cegielski, Irene Guessarian, Yury Lifshits and Yuri Matiyasevich

Window Subsequence Problems for Compressed Texts

CSR’06 // on-line version

Thank you for your attention! Questions?Yury Lifshits (Steklov Inst. of Math) Processing Compressed Texts CPM’07 23 / 23