classification of markov sources through joint string complexity: theory and experiments philippe...
TRANSCRIPT
Classification of Markov Sources Through Joint String Complexity:
Theory and Experiments
Philippe Jacquet, Dimitrios Milioris,
Bell Labs,
Wojciech Szpankowski,
Purdue U
How to compare
• Two DNA sequences?
• Two Tweets?
– Use the joint complexity!
"allow users to download an entire movie in one second." I need this http://t.co/3fbNfKEkahGreen energy boss accuses Govt of obstructing renewable energy
development http://t.co/v5Lq2Jx1GQ
Joint complexity
• Definition: – Two sequences X, Y, – J(X,Y) is the number of common factors between X
and Y• Factors contain consecutive symbols• X=banana, Y=ananas: J(X,Y)=10 , a, n, an, na, ana, nan, anan, nana, anana.
• The larger J(X,Y) the closer are X and Y– J(DNA of a cat, DNA of another cat)>J(cat,dog)
– J(tweet on politics,another tweet on politics)>J(politics, technology)
The joint complexity to measure text similarity
Facebook Posts Can Offer Clues of Depression
Royal Son-In-Law to Testify in Spanish Fraud Inquiry Iraqi Prisoner Tied to Hezbollah
Faces U.S. Military Charges
‘Friends of Syria’ Gather in Tunis to Add Pressure on Assad
In Australia, Ex-Premier Rudd to Challenge Gillard in Vote on Monday
F.B.I. Bribery Case Falls Apart and Raises Questions
China Urged to Continue Reforms for Growth
18 26
37
31
27
39
32
50
40
40
NY Time tweets
Use of joint complexity for text and twit classification, without reading them
Internationalpolitics
Sports,entertainment
Technologybusiness
Theoretical results
Markov sources
• Texts and are generated by stationary Markov sources 1 and 2– Of length n and m
• on alphabet A of finite size– Markov source i with transition matrix – Stationary distribution is right eigenvector €
Sn,m = E(J Xn ,Ym( ))
Markov sources
• We define
• We have
– With the main eigenvalue of – Q(x)=0 in general, sometimes periodic or
double periodic of small amplitude
€
Sn,n =βnκ
2π (α logn +δ )(1+Q(logn) +O(
1
logn))
PJ, “Common words between two random strings” ISIT 2007PJ, Szpankowski, “Joint complexity for Markov sources”, DMTCS AofA 2012
A toy example
• "yet while shannon's Theory has had profound social and economic impacts, its application beyond storage and point to point communication to the Internet, for example poses one of the most vexing challenges for scientists and engineers today. to keep pace with rapid advances in networking, biology, and quantum information processing, we need to rethink how we understand and integrate information. »
– A={" ", "'", ",", ".", "I", "T", "a", "b", "c", "d", "e", "f", "g", "h", "i", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y"}
• Probability vector (memoryless case)– P=[.148, .003, .013, .005, .002, .002, .065, .005, .030, .035, .087, .015, .020, .027, .067, .00
7, .017, .022, .097, .085, .030, .001, .037, .047, .072, .012, .005, .015, .005, .012]
Memoryless vs uniform: joint complexity
Markov sources
• Converges on short texts
• When
Same Markov sources
From texts to equations
• We have
• A combinatorial identity between formal series – For z complex
– function is the autocorrelation polynomial of string w.
€
Sn,n = P1 Xn w≥1( ) × P 2 Ym w
≥1( )w∈A∗
∑
aw (z)
PJ, Szpankowski, “Autocorrelation on words and its applications…” J of combinatorial Analysis, 1994Fayolle, Ward,”Analysis of the average depth in a suffix tree under a Markov model” DMTCS AofA 2005
€
P Xn w≥1( )z
n =P(w)z
(1− z) (1 − z)(1+ aw (z)) + P(w)z w( )n
∑
From text to equations
• In most cases
• We know that there is and– such that
– And
• (formally we forget the autocorrelation and set for )
€
∀k : P w( ) < O(ρ k )w∈A k −Bk
∑
€
∀k,∀w∈Bk : P Xn w≥1( ) = 1− (1− P (w))n( )(1+O(ρ k ))
PJ, Szpankowski, “Autocorrelation on words and its applications…” J of combinatorial Analysis, 1994Fayolle, Ward,”Analysis of the average depth in a suffix tree under a Markov model” DMTCS AofA 2005
aw (z) 1
aw (z) 0
€
z w = z
€
z ≈1
From text to equations
• Consequence: there exists ε>0
– With
– The larger the alphabet, the larger ε
Markov generating function
– Introducing double Poisson g.f.
• And a linear vector equation
Double Mellin transform
Double Mellin transform
• Classic definition is
– Satisfies the identity
– Defined for €
C∗(s1,s2) = s1Γ(s1)s2Γ(s2)(I −P(s1,s2))−11
€
I identity matrix, 1 vector full of 1's€
C∗(s1,s2) = [Ca (s1,s2)]a∈AT
Double Inverse Mellin transform
• We have
€
Cn,n =1
2iπ( )2 C∗(s1,s2)n−s1 −s2ds1ds2 ×ℜ (s1 ,s2 )=(ρ1 ,ρ 2 )
∫∫ (1+O(n−1))
-1 0
Singularity analysis
• The Kernel K is the set of such
• Let that minimizes and
– Via bi-dimensional Saddle point method
€
(s1,s2)
€
(c1,c2)∈K
€
−s1 − s2
€
Cn,n =1
2iπ( )2 C∗(s1,s2)n−s1 −s2ds1ds2 ×ℜ (s1 ,s2 )=(c1 ,c2 )
∫∫ (1+O(n−1))
€
Cn,n = f (c1,c2)n−c1 −c2
2π (α logn +δ )(1+O(
1
logn))
€
λ(s1,s2) =1€
λ(s1,s2) =1
Singularity analysis (cont)
€
α =α(c1,c2)
€
δ =δ(c1,c2)
with
€
κ =−c1 − c2
and A function of the left and right main eigenvectors of
€
Cn,n = f (c1,c2)nκ
2π (α logn +δ )(1+O(
1
logn))
functions of the derivatives of
€
λ(s1,s2)
Special cases 1: border effect
• When (or ) – Apply residue theorem on
€
c1 > 0
€
Γ(s1)
€
Cn,n = f (0,c2*)
n−c2*
∂
∂s1
λ (0,c2*)
(1+O(1
n))
€
ρ1,ρ 2
€
P1 =0 0.5
1 0.5
⎡
⎣ ⎢
⎤
⎦ ⎥
€
P2 =0.5 0.5
0.5 0.5
⎡
⎣ ⎢
⎤
⎦ ⎥€
(0,c2*)∈K
Special case 2: periodic terms
• We define the Kernel border
• Then
• In general
€
∂K = {(s1,s2)∈K∧ℜ(s1) = c1,ℜ(s2) = c2}
€
Cn,n = nκ f (s1,s2)Γ(s1)Γ(s2)
2π (αx +δ (s1,s2))n−i(ℑ (s1 )+ℑ (s2 ))
(s1 ,s2 )∈∂K
∑ (1+O(1
logn))
€
∂K = {(c1,c2)}
Periodic terms
• When it is a lattice
€
∂K ≠ {(c1,c2)}
Periodic terms• Plus optimal saddle point
expansion for n<1000
Case 3: nilpotent matrix
• If matrix is nilpotent then ∅– Number of common factors is bounded
€
P(s1,s2)
€
K =
€
limn→ ∞ Cn,n = 1 I −P(0,0)( )−11
Eng/Fr
Eng/Pl
Enough mathLet’s play with Twitter
Automated Tweet classification
politics sport lifestyleeconomicstechnology
http://t.co/Ni8bIPZEYD I wanna do this next year this is a goal
"allow users to download an entire movie in one second." I need this http://t.co/3fbNfKEkah
Green energy boss accuses Govt of obstructing renewable energy development http://t.co/v5Lq2Jx1GQ
“@Orbinho: No Cesc, No. http://t.co/gSbid9FCl0” More chance of Adebayor becoming world class tbh
@NadineDorriesMP needs to learn to add up. #LibDems 57 MPs vs UKIP 0 MPs http://t.co/2LrP2YL98
‘Awesome’ Fenyas Elegance takes runner-up spot at Chatsworth | Irish Examiner: http://t.co/3iW88RyASB via @irishexaminer
http://t.co/T2AJhaoL2p about a leader who could offer no change..
RT @Orbinho: No Cesc, No. http://t.co/QWXlKemghP
Beale doubtful for Lions series Australia back Kurtley Beale could miss the series against #Rugbyclubbusiness,#RFU http://t.co/JusK4lhMjU
RT @helenlewis: Michael Gove is relying on stats taken from PR surveys by Premier Inn & UK Gold. This govt has a problem w/evidence. http:/…
politics sport lifestyleeconomicstechnology
RT @GNev2: Good piece by Martin Samuel this morning in the Mail on Pellegrini potential appointment at city. http://t.co/3a7ytXGGAd
Mumbai residential prices up 66% in past four years. Poor victims of con job, made only 66%. Poor victims: http://t.co/qyr1BgucS5
RT @itvfootball: Pass the sick bucket - Serbian Ultras help out with a mid-match marriage proposal http://t.co/bZTAO3DslW
World's Longest Cat Dies ... http://t.co/gsvcPof5ju
Scary stuff! RT @the__socialite: RT @Asha_EK: TRA denies penalties for using Skype in UAE http://t.co/AS9mpBUHdR
Not allowed. Nope. RT @Orbinho: No Cesc, No. http://t.co/CRmf4g1bAH
Spotify rushes to fix download flaw that allows music 2 b downloaded http://t.co/tzSUmyjFm6
RT @kenjbarnes1: “@ktumulty: Wow: IRS targeted groups that criticized the government, IG report says http://t.co/4Znpb16aqJ”
Pittsburgh Steelers Undrafted Rookie Free Agent Profile: Northern Illinois Outsi[..] - http://t.co/opb8XVAsLE
politics sport lifestyleeconomicstechnology
RT @Will_Antonin: Once again, instead of just reporting the scandal itself, the press frames it as a *Republican* claim: http://t.co/GlaAnV…
… 首相は前と全然違う、成長された 谷垣氏 : 政治 http://t.co/1C3iQqvS2i 新聞記事というのはどうして合理的な理解が出来るような伝え方をしないのだろうか?どこを指して成長したと言っているのか、具体的に書かなければその判断が正しいかどうか評価しようがない。
http://t.co/HYVGQRneLs سيسك و فان بيرسي مع بعض في اليونايتد ؟! يمكن انتحر
RT @MirrorFootball: Holloway: "I expect the name of Wilfried Zaha to become much more than an answer to an obscure quiz question"#cpfc htt…
Print een Stormtrooper van jezelf in 3D #newslocker http://t.co/ibesVUIerA
えーと?何処に置くですかねー ( ̄^ ̄ ) ゞ“@Chooemon92: つまり日本は核廃棄物場になる RT @Bu_uuu: 安倍首相、
… 今度は東欧で原発セールス 4 国首脳と会談へ http://t.co/5GC2UCTaqz 原発で出た核廃棄物はすべて日本に返される契約になってる
RT @laurenlaverne: Pwned: "Mr Men" teacher hits back at Michael Gove http://t.co/cqGMzUsGNt
RT @reema80: Club with morals? Sheffield Wednesday turn down sponsorship deal with payday lender http://t.co/8hmndQdvzY (via @infoman71)
Phil Jones V Wayne Rooney? http://t.co/0V2YfxPRJi
politics sport lifestyleeconomicstechnology
자신이 불통인사를 한 장본인인데 진노했고 불호령을 했다니 사과한 게 맞나 ?@ 박 대통령 , ‘ ’ ‘ ’ 국민 앞 아닌 회의실 사과 http://t.co/KvDKS32DnD
19 травня «Дніпро» зустрінеться з «Металістом» http://t.co/1twA7WTdHj
Вот уже и таджики распродают потихоньку территорию Российской Империи http://t.co/zbjPe1ILZm http://t.co/dKEyqnBZB1
Почему в Грозном нельзя играть в футбол: http://t.co/OL4flpoBCb
Pelegrini: Ne preuzimam Siti http://t.co/7tVuvnu0ow via @Mondoportal Kvota 1.05 da malo lagi... #StanJames
RT @art19maroc: PSG : le trophée est dans la poche http://t.co/lSRDbL5XRi
RT @yukio20686515: 実は反省なし? の民主党「大反省会」 - MSN産経ニュース http://t.co/85adbCMMEHちょと見たけど、反省してる姿よりも、人間関係がはっきり見えた。敵視、口を濁す。何のために、でてきたんだろ。。。。。。。。。。。
Facebook Starts Home Improvements After App’s Lukewarm Reception http://t.co/ifTHiI4k0U
「広域処理 ずさん交付金 がれき以外に9割支出」2013-05-13 (TOKYO Web) http://t.co/u9FxYnyT3A
politics sport lifestyleeconomicstechnology
http://t.co/lQ6PuEBmBO... http://t.co/UtsTaVZ3at
Leonardo vreest na beenbreuk voor einde carrière #NUnl http://t.co/Dp7LbLvfk5
RT @daddy_san: Wonderful profile of F1's reluctant rockstar Kimi Raikkonen, why he hates interviews and a possible move to Red Bull. http:/…
Tiger Woods hangs on to win The Players Championship http://t.co/RdHTAG7pXC
RT @amrhamdon: أهالى البالك بلوك يتوجهون لطرة إلنهاء إجراءات اإلفراج عنهم | أخبار الموجز http://t.co/JxU88vYjSF… via @Almogaz #مصر #بالك_بل…
Läser http://t.co/fPapW7Fjjm Haha, ja än är inte undrens tid förbi!
#Germany, #Israel and #Belgium vie for UNSC Western European and Others seat for 2019: details http://t.co/buIeQ9dEel
RT @MaritzVB: #SocialMedia negatively affecting our lives? Psychologists say heart brakes are worse in the digital age http://t.co/ogSVUnrM…
Michael Gove accuses young people of being lazily reductive, uses lazily reductive sources to back it up: http://t.co/gkPksTORfP
RT @EducationLabour: Michael Gove announces his new centre for evidence based education policy http://t.co/rbsrduZ6b9
RT @hanayuu: @tokaiama 【アメとムチ】 自民党が原発再稼働と地域雇用創出と北陸新幹線敦賀延伸を「抱き合わせ」で提示
http://t.co/gFW4Igwykl
Conclusion and perspectives
• Extendable to Markov of any order
• Convergence of Markov to natural language
• Common factors to k texts (k>2)
• Open problem: variance of joint complexity– Related to variance of suffix tree size problem
Thank you
• QUESTIONS…