classification of markov sources through joint string complexity: theory and experiments philippe...

36
Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski, Purdue U

Upload: moris-kelly

Post on 11-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Classification of Markov Sources Through Joint String Complexity:

Theory and Experiments

Philippe Jacquet, Dimitrios Milioris,

Bell Labs,

Wojciech Szpankowski,

Purdue U

Page 2: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

How to compare

• Two DNA sequences?

• Two Tweets?

– Use the joint complexity!

"allow users to download an entire movie in one second." I need this http://t.co/3fbNfKEkahGreen energy boss accuses Govt of obstructing renewable energy

development http://t.co/v5Lq2Jx1GQ

Page 3: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Joint complexity

• Definition: – Two sequences X, Y, – J(X,Y) is the number of common factors between X

and Y• Factors contain consecutive symbols• X=banana, Y=ananas: J(X,Y)=10 , a, n, an, na, ana, nan, anan, nana, anana.

• The larger J(X,Y) the closer are X and Y– J(DNA of a cat, DNA of another cat)>J(cat,dog)

– J(tweet on politics,another tweet on politics)>J(politics, technology)

Page 4: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

The joint complexity to measure text similarity

Facebook Posts Can Offer Clues of Depression

Royal Son-In-Law to Testify in Spanish Fraud Inquiry Iraqi Prisoner Tied to Hezbollah

Faces U.S. Military Charges

‘Friends of Syria’ Gather in Tunis to Add Pressure on Assad

In Australia, Ex-Premier Rudd to Challenge Gillard in Vote on Monday

F.B.I. Bribery Case Falls Apart and Raises Questions

China Urged to Continue Reforms for Growth

18 26

37

31

27

39

32

50

40

40

NY Time tweets

Page 5: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Use of joint complexity for text and twit classification, without reading them

Internationalpolitics

Sports,entertainment

Technologybusiness

Page 6: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Theoretical results

Page 7: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Markov sources

• Texts and are generated by stationary Markov sources 1 and 2– Of length n and m

• on alphabet A of finite size– Markov source i with transition matrix – Stationary distribution is right eigenvector €

Sn,m = E(J Xn ,Ym( ))

Page 8: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Markov sources

• We define

• We have

– With the main eigenvalue of – Q(x)=0 in general, sometimes periodic or

double periodic of small amplitude

Sn,n =βnκ

2π (α logn +δ )(1+Q(logn) +O(

1

logn))

PJ, “Common words between two random strings” ISIT 2007PJ, Szpankowski, “Joint complexity for Markov sources”, DMTCS AofA 2012

Page 9: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

A toy example

• "yet while shannon's Theory has had profound social and economic impacts, its application beyond storage and point to point communication to the Internet, for example poses one of the most vexing challenges for scientists and engineers today. to keep pace with rapid advances in networking, biology, and quantum information processing, we need to rethink how we understand and integrate information. »

– A={" ", "'", ",", ".", "I", "T", "a", "b", "c", "d", "e", "f", "g", "h", "i", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y"}

• Probability vector (memoryless case)– P=[.148, .003, .013, .005, .002, .002, .065, .005, .030, .035, .087, .015, .020, .027, .067, .00

7, .017, .022, .097, .085, .030, .001, .037, .047, .072, .012, .005, .015, .005, .012]

Page 10: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Memoryless vs uniform: joint complexity

Page 11: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Markov sources

• Converges on short texts

• When

Page 12: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Same Markov sources

Page 13: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

From texts to equations

• We have

• A combinatorial identity between formal series – For z complex

– function is the autocorrelation polynomial of string w.

Sn,n = P1 Xn w≥1( ) × P 2 Ym w

≥1( )w∈A∗

aw (z)

PJ, Szpankowski, “Autocorrelation on words and its applications…” J of combinatorial Analysis, 1994Fayolle, Ward,”Analysis of the average depth in a suffix tree under a Markov model” DMTCS AofA 2005

P Xn w≥1( )z

n =P(w)z

(1− z) (1 − z)(1+ aw (z)) + P(w)z w( )n

Page 14: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

From text to equations

• In most cases

• We know that there is and– such that

– And

• (formally we forget the autocorrelation and set for )

∀k : P w( ) < O(ρ k )w∈A k −Bk

∀k,∀w∈Bk : P Xn w≥1( ) = 1− (1− P (w))n( )(1+O(ρ k ))

PJ, Szpankowski, “Autocorrelation on words and its applications…” J of combinatorial Analysis, 1994Fayolle, Ward,”Analysis of the average depth in a suffix tree under a Markov model” DMTCS AofA 2005

aw (z) 1

aw (z) 0

z w = z

z ≈1

Page 15: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

From text to equations

• Consequence: there exists ε>0

– With

– The larger the alphabet, the larger ε

Page 16: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,
Page 17: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Markov generating function

– Introducing double Poisson g.f.

• And a linear vector equation

Page 18: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Double Mellin transform

Page 19: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Double Mellin transform

• Classic definition is

– Satisfies the identity

– Defined for €

C∗(s1,s2) = s1Γ(s1)s2Γ(s2)(I −P(s1,s2))−11

I identity matrix, 1 vector full of 1's€

C∗(s1,s2) = [Ca (s1,s2)]a∈AT

Page 20: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Double Inverse Mellin transform

• We have

Cn,n =1

2iπ( )2 C∗(s1,s2)n−s1 −s2ds1ds2 ×ℜ (s1 ,s2 )=(ρ1 ,ρ 2 )

∫∫ (1+O(n−1))

-1 0

Page 21: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Singularity analysis

• The Kernel K is the set of such

• Let that minimizes and

– Via bi-dimensional Saddle point method

(s1,s2)

(c1,c2)∈K

−s1 − s2

Cn,n =1

2iπ( )2 C∗(s1,s2)n−s1 −s2ds1ds2 ×ℜ (s1 ,s2 )=(c1 ,c2 )

∫∫ (1+O(n−1))

Cn,n = f (c1,c2)n−c1 −c2

2π (α logn +δ )(1+O(

1

logn))

λ(s1,s2) =1€

λ(s1,s2) =1

Page 22: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Singularity analysis (cont)

α =α(c1,c2)

δ =δ(c1,c2)

with

κ =−c1 − c2

and A function of the left and right main eigenvectors of

Cn,n = f (c1,c2)nκ

2π (α logn +δ )(1+O(

1

logn))

functions of the derivatives of

λ(s1,s2)

Page 23: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Special cases 1: border effect

• When (or ) – Apply residue theorem on

c1 > 0

Γ(s1)

Cn,n = f (0,c2*)

n−c2*

∂s1

λ (0,c2*)

(1+O(1

n))

ρ1,ρ 2

P1 =0 0.5

1 0.5

⎣ ⎢

⎦ ⎥

P2 =0.5 0.5

0.5 0.5

⎣ ⎢

⎦ ⎥€

(0,c2*)∈K

Page 24: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Special case 2: periodic terms

• We define the Kernel border

• Then

• In general

∂K = {(s1,s2)∈K∧ℜ(s1) = c1,ℜ(s2) = c2}

Cn,n = nκ f (s1,s2)Γ(s1)Γ(s2)

2π (αx +δ (s1,s2))n−i(ℑ (s1 )+ℑ (s2 ))

(s1 ,s2 )∈∂K

∑ (1+O(1

logn))

∂K = {(c1,c2)}

Page 25: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Periodic terms

• When it is a lattice

∂K ≠ {(c1,c2)}

Page 26: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Periodic terms• Plus optimal saddle point

expansion for n<1000

Page 27: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Case 3: nilpotent matrix

• If matrix is nilpotent then ∅– Number of common factors is bounded

P(s1,s2)

K =

limn→ ∞ Cn,n = 1 I −P(0,0)( )−11

Eng/Fr

Eng/Pl

Page 28: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Enough mathLet’s play with Twitter

Page 29: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Automated Tweet classification

politics sport lifestyleeconomicstechnology

Page 30: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

http://t.co/Ni8bIPZEYD  I wanna do this next year this is a goal

"allow users to download an entire movie in one second." I need this http://t.co/3fbNfKEkah

Green energy boss accuses Govt of obstructing renewable energy development http://t.co/v5Lq2Jx1GQ

“@Orbinho: No Cesc, No. http://t.co/gSbid9FCl0”  More chance of Adebayor becoming world class tbh

@NadineDorriesMP needs to learn to add up. #LibDems 57 MPs vs UKIP 0 MPs http://t.co/2LrP2YL98

‘Awesome’ Fenyas Elegance takes runner-up spot at Chatsworth | Irish Examiner: http://t.co/3iW88RyASB via @irishexaminer

http://t.co/T2AJhaoL2p about a leader who could offer no change..

RT @Orbinho: No Cesc, No. http://t.co/QWXlKemghP

Beale doubtful for Lions series Australia back Kurtley Beale could miss the series against #Rugbyclubbusiness,#RFU http://t.co/JusK4lhMjU

RT @helenlewis: Michael Gove is relying on stats taken from PR surveys by Premier Inn &amp; UK Gold. This govt has a problem w/evidence. http:/…

politics sport lifestyleeconomicstechnology

Page 31: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

RT @GNev2: Good piece by Martin Samuel this morning in the Mail on Pellegrini potential appointment at city.   http://t.co/3a7ytXGGAd

Mumbai residential prices up 66% in past four years. Poor victims of con job, made only 66%. Poor victims: http://t.co/qyr1BgucS5

RT @itvfootball: Pass the sick bucket - Serbian Ultras help out with a mid-match marriage proposal http://t.co/bZTAO3DslW

World's Longest Cat Dies ...  http://t.co/gsvcPof5ju

Scary stuff! RT @the__socialite: RT @Asha_EK: TRA denies penalties for using Skype in UAE http://t.co/AS9mpBUHdR

Not allowed. Nope. RT @Orbinho: No Cesc, No. http://t.co/CRmf4g1bAH

Spotify rushes to fix download flaw that allows music 2 b downloaded http://t.co/tzSUmyjFm6

RT @kenjbarnes1: “@ktumulty: Wow: IRS targeted groups that criticized the government, IG report says http://t.co/4Znpb16aqJ”

Pittsburgh Steelers Undrafted Rookie Free Agent Profile: Northern Illinois Outsi[..] - http://t.co/opb8XVAsLE

politics sport lifestyleeconomicstechnology

Page 32: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

RT @Will_Antonin: Once again, instead of just reporting the scandal itself, the press frames it as a *Republican* claim: http://t.co/GlaAnV…

… 首相は前と全然違う、成長された 谷垣氏 : 政治 http://t.co/1C3iQqvS2i 新聞記事というのはどうして合理的な理解が出来るような伝え方をしないのだろうか?どこを指して成長したと言っているのか、具体的に書かなければその判断が正しいかどうか評価しようがない。

http://t.co/HYVGQRneLs سيسك و فان بيرسي مع بعض في اليونايتد ؟! يمكن انتحر

RT @MirrorFootball: Holloway: "I expect the name of Wilfried Zaha to become much more than an answer to an obscure quiz question"#cpfc htt…

Print een Stormtrooper van jezelf in 3D #newslocker http://t.co/ibesVUIerA

えーと?何処に置くですかねー ( ̄^ ̄ ) ゞ“@Chooemon92: つまり日本は核廃棄物場になる RT @Bu_uuu: 安倍首相、

… 今度は東欧で原発セールス 4 国首脳と会談へ http://t.co/5GC2UCTaqz 原発で出た核廃棄物はすべて日本に返される契約になってる

RT @laurenlaverne: Pwned: "Mr Men" teacher hits back at Michael Gove http://t.co/cqGMzUsGNt

RT @reema80: Club with morals? Sheffield Wednesday turn down sponsorship deal with payday lender http://t.co/8hmndQdvzY (via @infoman71)

Phil Jones V Wayne Rooney?  http://t.co/0V2YfxPRJi

politics sport lifestyleeconomicstechnology

Page 33: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

자신이 불통인사를 한 장본인인데 진노했고 불호령을 했다니 사과한 게 맞나 ?@ 박 대통령 , ‘ ’ ‘ ’ 국민 앞 아닌 회의실 사과 http://t.co/KvDKS32DnD

19 травня «Дніпро» зустрінеться з «Металістом» http://t.co/1twA7WTdHj

Вот уже и таджики распродают потихоньку территорию Российской Империи http://t.co/zbjPe1ILZm http://t.co/dKEyqnBZB1

Почему в Грозном нельзя играть в футбол: http://t.co/OL4flpoBCb

Pelegrini: Ne preuzimam Siti http://t.co/7tVuvnu0ow via @Mondoportal Kvota 1.05 da malo lagi... #StanJames

RT @art19maroc: PSG : le trophée est dans la poche http://t.co/lSRDbL5XRi

RT @yukio20686515: 実は反省なし? の民主党「大反省会」 - MSN産経ニュース http://t.co/85adbCMMEHちょと見たけど、反省してる姿よりも、人間関係がはっきり見えた。敵視、口を濁す。何のために、でてきたんだろ。。。。。。。。。。。

Facebook Starts Home Improvements After App’s Lukewarm Reception http://t.co/ifTHiI4k0U

「広域処理 ずさん交付金 がれき以外に9割支出」2013-05-13  (TOKYO Web) http://t.co/u9FxYnyT3A

politics sport lifestyleeconomicstechnology

Page 34: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

http://t.co/lQ6PuEBmBO... http://t.co/UtsTaVZ3at

Leonardo vreest na beenbreuk voor einde carrière #NUnl http://t.co/Dp7LbLvfk5

RT @daddy_san: Wonderful profile of F1's reluctant rockstar Kimi Raikkonen, why he hates interviews and a possible move to Red Bull. http:/…

Tiger Woods hangs on to win The Players Championship http://t.co/RdHTAG7pXC

RT @amrhamdon: أهالى البالك بلوك يتوجهون لطرة إلنهاء إجراءات اإلفراج عنهم | أخبار الموجز http://t.co/JxU88vYjSF… via @Almogaz #مصر #بالك_بل…

Läser http://t.co/fPapW7Fjjm Haha, ja än är inte undrens tid förbi!

#Germany, #Israel and #Belgium vie for UNSC Western European and Others seat for 2019: details  http://t.co/buIeQ9dEel

RT @MaritzVB: #SocialMedia negatively affecting our lives? Psychologists say heart brakes are worse in the digital age http://t.co/ogSVUnrM…

Michael Gove accuses young people of being lazily reductive, uses lazily reductive sources to back it up: http://t.co/gkPksTORfP

RT @EducationLabour: Michael Gove announces his new centre for evidence based education policy  http://t.co/rbsrduZ6b9

RT @hanayuu: @tokaiama 【アメとムチ】 自民党が原発再稼働と地域雇用創出と北陸新幹線敦賀延伸を「抱き合わせ」で提示

http://t.co/gFW4Igwykl

Page 35: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Conclusion and perspectives

• Extendable to Markov of any order

• Convergence of Markov to natural language

• Common factors to k texts (k>2)

• Open problem: variance of joint complexity– Related to variance of suffix tree size problem

Page 36: Classification of Markov Sources Through Joint String Complexity: Theory and Experiments Philippe Jacquet, Dimitrios Milioris, Bell Labs, Wojciech Szpankowski,

Thank you

• QUESTIONS…