coding for dna storage in live organisms - unicamp · (source:wikipedia)...
TRANSCRIPT
![Page 1: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/1.jpg)
Coding for DNA Storagein Live Organisms
Moshe Schwartz
Electrical & Computer EngineeringBen-Gurion University
Israel
![Page 2: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/2.jpg)
Based on joint works with: (alphabetically)
• Jehoshua Bruck – Caltech
• Ohad Elishco – Ben-Gurion University (now MIT)
• Farzad Farnoud (Hassanzadeh) – University of Virginia
• Siddharth Jain – Caltech
• Yonatan Yehezkeally – Ben-Gurion University
Introduction 2 / 79
![Page 3: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/3.jpg)
Science fiction distant future dream?
Introduction 3 / 79
![Page 4: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/4.jpg)
No – It’s just around the corner!
Introduction 4 / 79
![Page 5: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/5.jpg)
DNA is a long string
Genetic information is stored in DNA, which is astring of nucleotides: Adenine, Cytosine, Guanine,and Thymine.
In E. coli bacteria, genetic information is stored inabout 4 · 106 base pairs.In humans, genetic information is stored in over3 · 109 base pairs.
Introduction 5 / 79
![Page 6: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/6.jpg)
Why store information in DNA?
DNA is dense!
It stores information in the molecular level.
DNA can potentially hold 250 · 250 bytes (250 peta-byte) of informationin 1 gram of DNA.
If we were to use 8Tb hard-drives to store the same amount, we’ll need32000 hard-drives, with a total weight of about 25 tons!
Introduction 6 / 79
![Page 7: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/7.jpg)
OK, but why in living organisms?
• Reading from DNA is destructive, hence we need several copies.Living organisms replicate and solve this problem.
• Data longevity is (potentially) better, due to replication oforganisms.
• The organism’s outer shell provides extra protection.
• Labeling organisms for biological studies.
• Watermarking genetically modified organisms (GMOs).
Main disadvantage:
Mutations!
We need error-correcting codes.
Introduction 7 / 79
![Page 8: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/8.jpg)
Error-correcting codes – An age old story
An error-correcting code has two maincomponents:
1 An error ball: Its size and shape depend onthe kind of errors the channel induces.
2 A packing of error balls: Its density affectscommunication efficiency. Its structureaffects ease of encoding/decoding.
Introduction 8 / 79
![Page 9: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/9.jpg)
What kinds of errors do we expect?
Insertion Duplication
Substitution Deletionu v w
u w
u v w
u v w
u v′ w
u v w
u w
u v v w
Which is the most common? Unknown yet, but…
Introduction 9 / 79
![Page 10: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/10.jpg)
Repeated sequences are everywhere
More than 50% of human genome is repeated sequences!1
Repetitions were shown to be connected with diseases such as cancer,myotonic dystrophy, Huntington’s disease, and important phenomenasuch as chromosome fragility, expansion diseases, silencing genes, andrapid morphological variation.
Repetitions are common in other species as well, and are claimed to bea major evolutionary force during vertebrate evolution.1
1Lander et al., Nature 2001.Introduction 10 / 79
![Page 11: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/11.jpg)
Duplication processes may repeat
ACTCA⇓
ACTACTCA⇓
ACTATACTCA⇓
ACTATACACTCA
It is conceivable that a substantial portion of the unique genome, thepart that is not known to contain repeated sequences, also has itsorigins in ancient repeated sequences that are no longer recognizabledue to change over time.2
2Lander et al., Nature 2001.Introduction 11 / 79
![Page 12: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/12.jpg)
Duplication processes may differ
Palindromic Duplication Interspersed Duplication
End Duplication Tandem Duplicationu v w
u v w
u v w
u v w z
u v w v
u v vR w
u v v w
u v w v z
Introduction 12 / 79
![Page 13: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/13.jpg)
A formal definition
Definition
Let Σ be a finite alphabet, s ∈ Σ∗ some string, and T ⊆ Σ∗Σ∗a set of
string-duplication rules. A string-duplication system, S, defined by thetuple (Σ, s, T ), is the reflexive transitive closure of T operating on s,namely, S ⊆ Σ∗ is the minimal set for which:
1 s ∈ S.
2 s′ ∈ S and T ∈ T imply T(s′) ∈ S.
We write S = S(Σ, s, T ).
Introduction 13 / 79
![Page 14: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/14.jpg)
End duplication - formally
Definition (End Duplication)
Tendi,k (x) =
{uvwv if x = uvw, |u| = i, |v| = k
x otherwise.
T endk =
{Tendi,k
∣∣∣ i > 0}.
The end-duplication system is defined as Sendk = S(Σ, s, T end
k ).
u v w
u v w v
Introduction 14 / 79
![Page 15: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/15.jpg)
Tandem duplication - formally
Definition (Tandem Duplication)
Ttani,k (x) =
{uvvw if x = uvw, |u| = i, |v| = k
x otherwise.
T tank =
{Ttani,k
∣∣∣ i > 0}.
The tandem-duplication system is defined as Stank = S(Σ, s, T tan
k ).
u v w
u v v w
Introduction 15 / 79
![Page 16: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/16.jpg)
How expressive is a duplication system?
Definition
The capacity of a string system S ⊆ Σ∗ is defined by
cap(S) = lim supn→∞
log2 |S ∩ Σn|n
.
Definition
Let S ⊆ Σ∗ be a string system. We shall say S is fully expressive if forevery v ∈ Σ∗ there exist u,w ∈ Σ∗ such that uvw ∈ S.
Introduction 16 / 79
![Page 17: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/17.jpg)
We are interested in:
• How does the capacity depend on the choice of duplication rules?
• How does the capacity depend on the choice of seed string?
• Which systems are fully expressive?
• What is the connection between capacity and full expressiveness?
Introduction 17 / 79
![Page 18: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/18.jpg)
Some related previous work exists
Tandem duplication was studied in the context of formal languages:
• Martín-Vide and Paun, Acta Cybernetica (1999):Where are tandem-duplication languages located in the Chomskyhierarchy?
• Dassow, Mitrana and Paun, Bull. of the EATCS (1999):Binary tandem-duplication languages are regular.
• Ming-Wei, Bull. of the EATCS (2000):Non-binary tandem-duplication languages are irregular.
Introduction 18 / 79
![Page 19: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/19.jpg)
More related previous work exists
Tandem duplication was studied in an algorithmic context:
• Main and Lorentz, J. Alg. (1984), Gusfield and Stoye, J. Comp. andSystems Sci. (2004):How to efficiently find tandem duplications in a string.
• Matroud, Hendy, and Tuffley, Nucleic Acids Research (2011):How to efficiently find nested tandem duplications.
• Elemento et al., Molecular Bio. and Evolution (2002), Lajoie et al.,J. Comp. Biology (2007), Brejová et al., Phil. Trans. R. Soc. A (2014):How to reconstruct the derivation process of a tandem-duplicatedstring.
Introduction 19 / 79
![Page 20: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/20.jpg)
End duplication has full capacity
Theorem
For Sendk = S(Σ, s, T end
k ), |s| > k,
cap(Sendk ) = log2 |Σ| .
AssumptionThe initial string s contains every symbols of Σ at least once.
End Duplication 20 / 79
![Page 21: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/21.jpg)
End duplication has full capacity (Cont.)
Proof.
6: We obviously have,
cap(Sendk ) = lim sup
n→∞
log2∣∣Send
k ∩ Σn∣∣
n
6 lim supn→∞
log2 |Σn|n
= log2 |Σ| .
End Duplication 21 / 79
![Page 22: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/22.jpg)
End duplication has full capacity (Cont.)
Proof.
>: We claim that starting with any string s ∈ Σ>k, with each symbolappearing at least once, and any w = w1w2 . . .wk ∈ Σk, we can derive astring y with w as a suffix.Step I: Duplicate prefix. Assume s = uv, |u| = k, then
s = uv ⇒ uvu = s′.
Observation: Every symbol of Σ appears in the beginning and end of ak-substring of s′.Step II: Force w1 at the end.
k
w1 w1⇒
End Duplication 22 / 79
![Page 23: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/23.jpg)
End duplication has full capacity (Cont.)
Proof.
Step III: Force w1w2 at the end.
k
w2 w1 w1 w2⇒
and then
k
w1w2 w1w2⇒
Repeat Step III inductively to get w1w2 . . .wk as a suffix.
End Duplication 23 / 79
![Page 24: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/24.jpg)
End duplication has full capacity (Cont.)
Proof.
Step IV: Repeat previous steps to get every k-word from Σk as asubstring.Thus, after at most 2k |Σ|k duplications we get a string s′′ containing allpossible k-substrings, |s′′| 6 2k2 |Σ|k.For any n = |s′′|+ tk we can now create |Σ|tk distinct strings. Hence,
cap(Sendk ) = lim sup
n→∞
log2∣∣Send
k ∩ Σn∣∣
n> lim sup
t→∞
log2(|Σ|tk)
|s′′|+ tk> log2 |Σ| .
Corollary
Sendk systems are fully expressive.
End Duplication 24 / 79
![Page 25: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/25.jpg)
Tandem duplication behaves differently
But first…Main tool – φk-transform domain. We assume WLOG that Σ = Zq.
Definition
We define the transform φk : Z>kq → Zkq × Z∗
q by,
φk(x) = (Prefk(x), Suff|x|−k(x)− Pref|x|−k(x)),
as well as ζi,k : Zkq × Z∗
q → Zkq × Z∗
q,
ζi,k(x, y) =
{(x, u0kw) if y = uw, |u| = i
(x, y) otherwise,
where Prefi(x) and Suffi(x) are, respectively, the i-prefix and i-suffix of x.
Tandem Duplication 25 / 79
![Page 26: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/26.jpg)
Main tool - φk-transform domain
Lemma
The following diagram commutes:
Z>kq
Ttani,k−−−−→ Z>kqyφk
yφk
Zkq × Z∗
qζi,k−−−−→ Zk
q × Z∗q
i.e., for every string x ∈ Z>kq ,
φk(Ttani,k (x)) = ζi,k(φk(x)).
Tandem Duplication 26 / 79
![Page 27: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/27.jpg)
Main tool - φk-transform domain
ExampleAssume Σ = Z4. Starting with 02123 and letting i = 1 and k = 2 leads to
02123Ttan1,2−−−−→ 0212123yφ2
yφ2
(02, 102)ζ1,2−−−−→ (02, 10002)
where the inserted elements are underlined.
Tandem Duplication 27 / 79
![Page 28: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/28.jpg)
Tandem duplication behaves differently
Theorem
For Stank = S(Σ, s, T tan
k ), |s| > k, cap(Stank ) = 0.
Proof.
In the φk-transform domain, φk(s) = (x, y), and tandem duplicationbecomes an insertion of 0k in the y-part. Thus, a tandem duplicationoperation is equivalent to throwing k balls into a bin. There are atmost |y| = |s| − k+ 1 bins. Thus, after t tandem-duplicated operations,there are at most
(|s|−k+tt
)6 (|s| − k+ t)|s|−k outcomes. Thus,
cap(Stank ) 6 lim sup
t→∞
log2((|s| − k+ t)|s|−k)
|s|+ tk= 0
Tandem Duplication 28 / 79
![Page 29: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/29.jpg)
Tandem duplication behaves differently
Corollary
Stank systems are never fully expressive.
Proof.
If φk(s) = (x, y), then all possible mutations are limited (in theφk-transform domain) to (x, y′) with y′ being the same as y except forextra zeros. Thus, φ−1
k (x, y1) can never be obtained from s.
Tandem Duplication 29 / 79
![Page 30: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/30.jpg)
Were we too strict?
Definition (Tandem Duplication)
Ttani,k (x) =
{uvvw if x = uvw, |u| = i, |v| = k
x otherwise.
T tan>k =
{Ttani,k′
∣∣∣ i > 0, k′ > k}.
The lower-bounded tandem-duplication system is defined asStan>k = S(Σ, s, T tan
>k ).
u v w
u v v w
Tandem Duplication 30 / 79
![Page 31: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/31.jpg)
Yes, we were! Here’s full expressiveness:
Theorem
Stan>k is fully expressive.
Proof.
Employ a similar procedure to generate each substring as in the prooffor Send
k , only each time copy a suffix of the string (from the chosenstarting point, to the end).
Tandem Duplication 31 / 79
![Page 32: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/32.jpg)
What about full capacity?
Theorem
For any finite alphabet Σ, and s ∈ Σ∗, we have
cap(Stan>1 ) > log2(r+ 1),
where r is the largest (real) root of the polynomial
f(x) = x|Σ| −|Σ|−2∑i=0
xi.
Proof Strategy: Find a set S ⊆ Stan>1 for which we can calculate the
capacity. But how?
Tandem Duplication 32 / 79
![Page 33: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/33.jpg)
Regular languages to the rescue
Definition (Recipe for a regular language)
• A finite alphabet Σ
• A finite directed labeled graph G = (V, E, L), with E ⊆ V× V in themultiset sense, and L : E → Σ.
• A starting state s ∈ V and a set of accepting states A ⊆ V.
• If e1e2 . . . en is a directed path in G, it generates the wordL(e1)L(e2) . . . L(en).
• The language represented by G, denoted S(G), is defined as the setof all words generated by directed paths starting at s and endingin A.
Tandem Duplication 33 / 79
![Page 34: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/34.jpg)
A simple example for a regular language
ExampleConsider the following directed labeled graph G:
0
1
0
S(G) is the set of all binary strings where a 1 is followed by a 0.
Tandem Duplication 34 / 79
![Page 35: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/35.jpg)
Graphs have properties
Definition
Let G = (V, E, L) be a graph generating a regular language.
• G is irreducible if for every v1, v2 ∈ V, there is a directed pathv1 v2.
• G is primitive if it is irreducible and the gcd of all cycle lengths is 1.
• G is lossless if for every v1, v2 ∈ V, and every word w ∈ Σ∗, there isat most one path v1 v2 that generates w.
Tandem Duplication 35 / 79
![Page 36: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/36.jpg)
Counting paths is easy
Definition
For G = (V, E, L) define the adjacency matrix AG = (au,v) as the |V| × |V|matrix where au,v is the number of edges from u to v in G.
Observation
• The number of paths u v of length n is exactly (AnG)u,v.
• For a lossless graph G with one accepting state, i.e., A = {v}, wehave |S(G) ∩ Σn| = (AnG)s,v.
• Thus (with the above setting),
cap(S(G)) = lim supn→∞
log2((AnG)s,v)
n.
Tandem Duplication 36 / 79
![Page 37: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/37.jpg)
Enter Perron and Frobenius
O. Perron
G. Frobenius(Source: Wikipedia)
Theorem (Perron-Frobenius (Partial))
If G is a primitive graph then:
1 λ = λ(AG) , max {|µ| : µ is an eigenvalue of AG} alsocalled the spectral radius of G, is an eigenvalue of AG .
2 There exist y, x > 0, unique (up to scalar multiplication)left and right eigenvectors for λ.
3 If y · xT = 1, then
limn→∞
1
λnAnG = xT · y.
Corollary
For a primitive lossless graph G, cap(S(G)) = log2(λ(AG)).
Tandem Duplication 37 / 79
![Page 38: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/38.jpg)
Back to Stan>1
Proof.
Main Idea: Find a regular language that “resides” within Stan>1 and use
its capacity to lower bound cap(Stan>1 ).
Phase I: Denote the alphabet letters as a1, a2, . . . , a|Σ|. As in the proofof full expressiveness, assume we reach a string with a|Σ| . . . a2a1 as asuffix. From now on, we ignore everything except this suffix.Phase II: Run in iterations. In iteration i, where i = |Σ| , |Σ| − 1, . . . , 3, 2use tandem duplication only on strings of the form aiai−1 . . . a1. In thelast iteration, tandem duplicate single letters.It is easy to verify the resulting strings form the following regularlanguage,
S =
(a+|Σ|
(a+|Σ|−1
(. . .(a+2(a+1)+)+)+
)+)+
.
Tandem Duplication 38 / 79
![Page 39: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/39.jpg)
Proof by sub-language (Cont.)
Proof.
S =
(a+|Σ|
(a+|Σ|−1
(. . .(a+2(a+1)+)+)+
)+)+
.
a|Σ| a|Σ|−1 a2 a1
a|Σ| a|Σ|−1 a2
a1
a1a1
Tandem Duplication 39 / 79
![Page 40: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/40.jpg)
Proof by sub-language (Cont.)
Proof.
The graph is lossless, irreducible, and primitive. Its adjacency matrix is
AG =
1 1
1 11 1
. . .. . .1 1
1 1 1 . . . 1 1
,
Thus, the number of paths of length n from the starting vertex to theaccepting vertex grows exponentially as λn, where λ is the spectralradius of the graph, i.e., the largest root of
χAG (λ) = det(λI− AG) = (λ− 1)|Σ| −|Σ|−2∑i=0
(λ− 1)i.
Set x = λ− 1 and we obtain the result.
Tandem Duplication 40 / 79
![Page 41: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/41.jpg)
What do we have so far?
CapacityType System Zero Partial Full Full Expressiveness
EndSendk − − X XSend>k − − X X
TandemStank X − − −Stan>k − ? ? X
Open Question
Find cap(Stan>k ) or improve the bounds on it.
Tandem Duplication 41 / 79
![Page 42: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/42.jpg)
Full capacity⇔ full expressiveness?
Theorem
Let S be a string system over the alphabet Σ. If S has full capacity then Shas full expressiveness.
Proof.
Assume to the contrary S never contains w ∈ Σk as a substring.Partition every word x ∈ S into blocks of length k (and perhaps aremainder block of length at most k− 1). Each block has at most|Σ|k − 1 choices, since w is forbidden. Thus,
|S ∩ Σn| 6 (|Σ|k − 1)bn/kc · |Σ|k−1 .
Thencap(S) 6 log2(|Σ|
k − 1)
k< log2 |Σ| .
Tandem Duplication 42 / 79
![Page 43: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/43.jpg)
What about the other direction?
ExampleConsider the following string system,
S = {vv | v ∈ Σ∗} .
It is obvious that S is fully expressive, but
cap(S) =1
2.
Open Question
This example is not a string-duplication system. What is theconnection between full capacity and full expressiveness forstring-duplication systems?
Tandem Duplication 43 / 79
![Page 44: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/44.jpg)
A bit more on the big picture…
CapacityType System Zero Partial Full Full Expressiveness
EndSendk − − X XSend>k − − X X
TandemStank X − − −Stan>k − ? ? X
Palindromic Spalk − ? ? X
Interspersed Sintk,k′ X X ? X
Open Question
Complete the missing pieces in this table.
Tandem Duplication 44 / 79
![Page 45: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/45.jpg)
Let’s add probability to the mix
Why?
• Real biological processes are not always deterministic.
• Just like Shannon vs. Hamming: it is interesting!
Case study:
• Binary alphabet, Σ = {0, 1}, Duplication length k = 1.
• The position to duplicate is chosen independently and uniformly.• Two options:
• Stan1 – Tandem duplication: bit b becomes bb.
• Stan1 – Complement tandem duplication: bit b becomes bb.
Pólya String Models 45 / 79
![Page 46: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/46.jpg)
Is this a Pólya urn model?
An urn contains B black balls and W white balls.At each step, a ball is extracted uniformly andindependently from the urn. The ball is returnedto the urn, together with another ball of thesame color. The process repeats.
Crucial difference:
There is no string structure in a Pólya urn model.
Pólya String Models 46 / 79
![Page 47: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/47.jpg)
How would we define capacity?
Let S(i) denote the random variable whose value is the string after imutations, and S(0) = s the seed string.
Definition
The probabilistic capacity of the process S is defined as
capProb(S) = lim supn→∞
1
nH(S(n)),
where H(S(n)) is the entropy of S(n), i.e.,
H(S(n)) = −∑w∈Σ∗
Pr(S(n) = w) log2 Pr(S(n) = w).
The combinatorial capacity will be denoted by capComb.
Pólya String Models 47 / 79
![Page 48: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/48.jpg)
Not everything is uniformly distributed
Assume Stan1 with S(0) = 0:
n = 0 :
n = 1 :
n = 2 :
n = 3 :
0ε
01
1
011
21
0111
321
0101
231
0110
213
010
12
0110
312
0100
132
0101
123
Thus, Pr(S(3) = 0110) = 13 but Pr(S(3) = 0111) = 1
6 .
Pólya String Models 48 / 79
![Page 49: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/49.jpg)
One simple connection exists
Lemma
For S ∈{Stan1 , Stan
1
}, capProb(S) 6 capComb(S).
Proof.
H(S(n)) is maximized when S(n) is uniformly distributed,
H(S(n)) 6 log2
∣∣∣S ∩ Σ|S(0)|+n∣∣∣ .
Thus,
capProb(S) = lim supn→∞
1
nH(S(n))
6 lim supn→∞
1
nlog2
∣∣∣S ∩ Σ|S(0)|+n∣∣∣ = capComb(S).
Pólya String Models 49 / 79
![Page 50: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/50.jpg)
So for tandem duplication…
Corollary
For any S(0) we havecapProb(S
tan1 ) = 0.
Proof.
We obviously have capProb(Stan1 ) > 0. Additionally,
capProb(Stan1 ) 6 capComb(S
tan1 ) = 0,
which we already proved.
Pólya String Models 50 / 79
![Page 51: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/51.jpg)
Complement-tandem duplication is harder
Assume S(0) = 0 for simplicity. Let us record the history of mutationsin a string, whose ith position equals j if the jth mutation caused theith symbol.
Example 0 → 01 → 010 → 0110,ε → 1 → 12 → 312.
Observation
1 History is a permutation.
2 Each permutation is equally likely.
Pólya String Models 51 / 79
![Page 52: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/52.jpg)
Here it is again
Assume Stan1 with S(0) = 0:
n = 0 :
n = 1 :
n = 2 :
n = 3 :
0ε
01
1
011
21
0111
321
0101
231
0110
213
010
12
0110
312
0100
132
0101
123
Some histories results in the same mutated string.
Pólya String Models 52 / 79
![Page 53: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/53.jpg)
It’s all in the signature
Definition
The signature of a permutation π ∈ Sn, is a binary stringw = w1w2 . . .wn−1, where
wi =
{0 π(i) > π(i+ 1),
1 π(i) < π(i+ 1).
Theorem
Consider Stan1 with S(0) = 0. Then Pr(S(n) = 01w) is the same as the
probability of getting the signature w when choosing a permutation fromSn (uniformly).
Pólya String Models 53 / 79
![Page 54: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/54.jpg)
It’s all in the signature – Proof
Proof.
Assuming w ∈ {0, 1}n−1, some notation first:
1 Π01w – The set of history permutations that lead to a mutatedstring 01w.
2 Ψw – The set of permutations from Sn with signature w.
3 For any string v ∈ {0, 1}`, the set of positions where 0 is precededby a 1 (including possible edges)
Tv = {i ∈ [`+ 1] : (vi−1 = 1 or i = 1) and (vi = 0 or i = `+ 1)}
Example: for v = 0011010 we have Tv = {1, 5, 7}.
Pólya String Models 54 / 79
![Page 55: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/55.jpg)
It’s all in the signature – Proof (Cont.)
Proof.
Strategy: Prove |Π01w| = |Ψw| by showing both expressions have thesame recursion with the same starting conditions.
Starting conditions: Trivially |Π01ε| = |Ψε| = 1.
Recursion for Ψw: Given w ∈ {0, 1}n−1, we can recursively construct apermutation π ∈ Sn with signature w by picking π−1(n), which canonly be some i ∈ Tw. We then recursively construct two permutations,with signatures w1...i−2 and wi...n−1. Thus,
|Ψw| =∑i∈Tw
(n− 1
i− 1
) ∣∣Ψw1...i−2
∣∣ ∣∣Ψwi+1...n−1
∣∣ .
Pólya String Models 55 / 79
![Page 56: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/56.jpg)
It’s all in the signature – Proof (Cont.)
Proof.
Recursion for Π01w: Given w ∈ {0, 1}n−1, consider a historypermutation π ∈ Sn resulting in the mutated sequence 01w. Obviouslyπ−1(1) is a position of a bit 1 in 01w which is last in a run, i.e.,followed by a 0 or last in the string. Thus, pick π−1(1), and constructthe rest of the permutation recursively using w1...i−2 and wi...n−1. Thus,
|Π01w| =∑i∈Tw
(n− 1
i− 1
) ∣∣Π01w1...i−2
∣∣ ∣∣Π10wi+1...n−1
∣∣ .
Pólya String Models 56 / 79
![Page 57: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/57.jpg)
Last time I’m showing this slide
Assume Stan1 with S(0) = 0:
n = 0 :
n = 1 :
n = 2 :
n = 3 :
0ε
01
1
011
21
0111
321
0101
231
0110
213
010
12
0110
312
0100
132
0101
123
Open Question
Find a nice bijection between Π01w and Ψw.
Pólya String Models 57 / 79
![Page 58: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/58.jpg)
And now, the capacity
Theorem
For Stan1 with S(0) = 0,
0.7213 ≈ log2(e)2
6 capProb(Stan1 ) 6 H2
(1
3
)≈ 0.9183,
where H2(x) , −x log2(x)− (1− x) log2(1− x) is the binary entropyfunction.
Pólya String Models 58 / 79
![Page 59: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/59.jpg)
Proof of the bounds
Proof.
Consider the real random variables X1, X2, . . . , chosen i.i.d., uniformlyfrom [0, 1]. Sorting Xn1 , X1, X2, . . . , Xn generates a random permutation(due to symmetry, chosen uniformly from Sn).Define
Qi ,{1 Xi < Xi+1
0 Xi > Xi+1,
(except for a 0-measure undefined set). So Qn−11 is a signature of a
uniformly chosen random permutation from Sn.
Pólya String Models 59 / 79
![Page 60: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/60.jpg)
Proof of the bounds (Cont.)
Proof.
We now havePr(S(n) = 01w) = Pr(Qn−1
1 = w).
and
capProb(Stan1 ) = lim sup
n→∞
1
nH(S(n)) = lim sup
n→∞
1
nH(Qn−1
1 )
= lim supn→∞
1
n
n−1∑i=1
H(Qi | Qi−11 ).
Pólya String Models 60 / 79
![Page 61: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/61.jpg)
Proof of the bounds (Cont.)
Proof.
Lower bound: Since Qi−11 → Xi → Qi we have
H(Qi | Qi−11 ) > H(Qi | Xi).
Furthermore, Pr(Qi = 0 | Xi = x) = x. Thus,
capProb(Stan1 ) = lim sup
n→∞
1
n
n−1∑i=1
H(Qi | Qi−11 )
> lim supn→∞
1
n
n−1∑i=1
H(Qi | Xi)
= H(Q1 | X1) =∫ 1
0H2(x)dx =
log2(e)2
.
Pólya String Models 61 / 79
![Page 62: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/62.jpg)
Proof of the bounds (Cont.)
Proof.
Upper bound:
capProb(Stan1 ) = lim sup
n→∞
1
n
n−1∑i=1
H(Qi | Qi−11 ) 6 lim sup
n→∞
1
n
n−1∑i=2
H(Qi | Qi−1)
= H(Q2 | Q1) =1
2(H(Q2 | Q1 = 0) + H(Q2 | Q1 = 1))
= H2
(1
3
),
since
Pr(Q2 = 0 | Q1 = 0) =
∫ 10 dx1
∫ x10 dx2
∫ x20 dx3∫ 1
0 dx1∫ x10 dx2
=1/6
1/2=
1
3,
and similarly for Pr(Q2 = 1 | Q1 = 1).
Pólya String Models 62 / 79
![Page 63: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/63.jpg)
Probabilistic 6= Combinatorial
Observation
capProb(Stan1 ) 6 H2
(1
3
)< 1 = capComb(S
tan1 ).
Open Questions
1 Find capProb(Stan1 ).
2 We know nothing for duplication length > 2.
Pólya String Models 63 / 79
![Page 64: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/64.jpg)
Moving on to error correction
An error-correcting code has two maincomponents:
1 An error ball: Its size and shape depend onthe kind of errors the channel induces.
2 An error ball: Its size and shape depend onthe kind of errors the channel induces.
3 A packing of error balls: Its density affectscommunication efficiency. Its structureaffects ease of encoding/decoding.
Error-Correcting Codes 64 / 79
![Page 65: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/65.jpg)
Let us recall the scenario
• Information is stored in the DNA of some bacteria.
• The bacteria mutate over time.
• When the information is read, the DNA has gone through a(perhaps unbounded) number of duplications.
Goal
Protect information against duplication errors!
Case studyWe focus on Stan
k – tandem duplication with fixed duplication length k.
Error-Correcting Codes 65 / 79
![Page 66: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/66.jpg)
Some definitions are required
Definition
If v ∈ Stank (Σ, u, T tan
k ), we denote it as u=⇒∗k v. We
say u is an ancestor of v, and v is a descendant ofu. We define the descendant cone of u as
D∗k(u) =
{v ∈ Σ∗ : u
∗=⇒k
v},
and the ancestor cone as
A∗k(u) ={v ∈ Σ∗ : v
∗=⇒k
u}.
A∗k (u)
D∗k (u)
u
Time
Error-Correcting Codes 66 / 79
![Page 67: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/67.jpg)
Now we define a code
Definition
An (n,M; ∗)k code C is a subset C ⊆ Σn of size |C| = M, such that foreach u, v ∈ C, u 6= v,
D∗k(u) ∩ D∗
k(v) = ∅.
The decoding problem
Given an (n,M; ∗)k code C, and a (mutated) word v ∈ Σ∗, find
Decode(v) = A∗k(v) ∩ C.
Error-Correcting Codes 67 / 79
![Page 68: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/68.jpg)
Reminder – The φk-transform
We assume WLOG that Σ = Zq.
Definition
We define the transform φk : Z>kq → Zkq × Z∗
q by,
φk(x) = (Prefk(x), Suff|x|−k(x)− Pref|x|−k(x)),
as well as ζi,k : Zkq × Z∗
q → Zkq × Z∗
q,
ζi,k(x, y) =
{(x, u0kw) if y = uw, |u| = i
(x, y) otherwise,
where Prefi(x) and Suffi(x) are, respectively, the i-prefix and i-suffix of x.
Error-Correcting Codes 68 / 79
![Page 69: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/69.jpg)
Main tool - φk-transform domain
Lemma
The following diagram commutes:
Z>kq
Ttani,k−−−−→ Z>kqyφk
yφk
Zkq × Z∗
qζi,k−−−−→ Zk
q × Z∗q
i.e., for every string x ∈ Z>kq ,
φk(Ttani,k (x)) = ζi,k(φk(x)).
Error-Correcting Codes 69 / 79
![Page 70: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/70.jpg)
Main tool - φk-transform domain
ExampleAssume Σ = Z4. Starting with 02123 and letting i = 1 and k = 2 leads to
02123Ttan1,2−−−−→ 0212123yφ2
yφ2
(02, 102)ζ1,2−−−−→ (02, 10002)
where the inserted elements are underlined.
Error-Correcting Codes 70 / 79
![Page 71: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/71.jpg)
The ancestors are the key component
A∗k (u)
D∗k (u)
u
Time
Rk(u)
Definition
If A∗k(v) = {v} we say v is irreducible. The set ofirreducible words is denoted Irrk. The roots of v ∈ Σ∗
are defined by Rk(v) = A∗k(v) ∩ Irrk.
Lemma
For tandem duplication of length k, and every v ∈ Σ∗,|Rk(v)| = 1.
Already proved by Leupold et al. (2005). We give adifferent proof, using φk, enabling a code construction.
Error-Correcting Codes 71 / 79
![Page 72: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/72.jpg)
Proof of root uniqueness
Proof.
Denote φk = (x, y), and y = 0m0y10m1y20m2 . . . 0mt−1yt0mt , where yi 6= 0for all i. Any ancestor v′ ∈ A∗k(v) must be of the form,
φk(v′) = (x, 0m0−i0ky10m1−i1ky20m2−i2k . . . 0mt−1−it−1kyt0mt−itk),
and it is irreducible if and only if
φk(v′) = (x, 0m0 mod ky10m1 mod ky20m2 mod k . . . 0mt−1 mod kyt0mt mod k),
giving a unique root.
Error-Correcting Codes 72 / 79
![Page 73: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/73.jpg)
Disjoint descendant cones are simple
Corollary
D∗k(u) ∩ D∗
k(v) 6= ∅ if and only if Rk(u) = Rk(v).
Proof.
⇒: If w ∈ D∗k(u) ∩ D∗
k(v) then
Rk(u)∗=⇒k
u∗=⇒k
w and Rk(v)∗=⇒k
v∗=⇒k
w,
and since the root of w is unique, Rk(u) = Rk(v).
Error-Correcting Codes 73 / 79
![Page 74: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/74.jpg)
Disjoint cones proof (Cont.)
Proof.
⇐: If Rk(u) = Rk(v) then denote
φk(Rk(u)) = φk(Rk(v)) = (x, 0m0y10m1y20m2 . . . 0mt−1yt0mt).
Then,
φk(u) = (x, 0m′0y10m
′1y20m
′2 . . . 0m
′t−1yt0m
′t)
φk(v) = (x, 0m′′0 y10m
′′1 y20m
′′2 . . . 0m
′′t−1yt0m
′′t ).
Define w ∈ Σ∗ such that
φk(w) = (x, 0max(m′0,m
′′0 )y10max(m′
1,m′′1 ) . . . 0max(m′
t−1,m′′t−1)yt0max(m′
t,m′′t )),
which immediately shows u=⇒∗k w and v=⇒∗
k w.
Error-Correcting Codes 74 / 79
![Page 75: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/75.jpg)
Putting it all together
Theorem
• v ∈ Irrk iff φk(v) = (x, y) and y is (0, k− 1)-RLL.
• Irrk ∩Σn is an (n,M; ∗)k-code.• Decoding v ∈ Σ∗ may be done in linear time by:
1 Finding φk(v) = (x, y).2 Reducing runs of 0’s in y modulo k to obtain y′.3 Returning the answer φ−1
k (x, y′).
Observation
The code may be further enlarged (and made optimal!) by carefullyadding shorter RLL sequences.
Error-Correcting Codes 75 / 79
![Page 76: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/76.jpg)
Other results
• Stan63 : forms a regular language, unique root, positive (though notfull) capacity, not fully expressive.3
• Unique root exists in several other cases, enabling codeconstruction and decoding.3
Theorem
Let Σ 6= ∅ be an alphabet, and U ⊆ N, U 6= ∅, a set of tandem-duplicationlengths. Denote k = min(U). Then (Σ,U) is a unique-root pair if and onlyif it matches one of the following cases:
|Σ| = 1 U ⊆ kN
|Σ| = 2U = {k}U ⊇ {1, 2}
|Σ| > 3U = {k}U = {1, 2}U = {1, 2, 3}
3Jain et al., IEEE Trans. on Inform. Th. 2017.Conclusion 76 / 79
![Page 77: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/77.jpg)
Other results
• What is the longest duplication distance to the root (inunbounded tandem duplication)? Apparently for length nsequences it is Θ(n) in the worst (and common!) case.4
• In the probabilistic models we know also the capacity of endduplication, as well as a mix duplication and complementduplication – but only for duplication length k = 1.5
• Tandem duplication with point-mutation (substitution) has morecapacity and expressiveness, but requires more care whenconstructing error-correcting codes.6
4Alon et al., ISIT 2016.5Elishco et al., ISIT 2016.6Jain et al., ISIT 2017.
Conclusion 77 / 79
![Page 78: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/78.jpg)
Many open questions remain!
Open Questions
• Study error-correcting codes for duplication models other thantandem duplication.
• Find error-correcting codes for a probabilistic channel, correctingtypical errors.
• Study a mix of duplication and other mutations (substitutions,insertions/deletions).
• Study error models which are context sensitive.
• For the biologists: Find out the channel parameters in the realworld.
Conclusion 78 / 79
![Page 79: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also](https://reader036.vdocuments.site/reader036/viewer/2022071018/5fd20708869d2227fe0459cd/html5/thumbnails/79.jpg)
Thank You
Conclusion 79 / 79