detecting patterns in north korean military provocations ...idse.or.kr/file/whang_1.pdf · that...

28
Detecting patterns in North Korean military provocations: what machine-learning tells us Taehee Whang 1 , Michael Lammbrau 2 and Hyung-min Joo 3 1 Department of Political Science & International Studies, Seoul, South Korea; 2 Department of Intelligence Studies, Mercyhurst University, Erie, USA; 3 Department of Political Science & International Relations, Seoul, South Korea Email: [email protected] Accepted 11 September 2016 Abstract For the past two decades, North Korea has made a series of military provocations, destabilizing the regional security of East Asia. In particu- lar, Pyongyang has launched several conventional attacks on South Korea. Although these attacks seem unpredictable and random, we at- tempt in this article to find some patterns in North Korean provoca- tions. To this end, we employ a machine-learning technique to analyze news articles of the Korean Central News Agency (KCNA) from 1997 to 2013. Based on five key words (‘years,’ ‘signed,’ ‘assembly,’ ‘June,’ and ‘Japanese’), our model identifies North Korean provocations with 82% accuracy. Further investigation into these attack words and the contexts in which they appear produces significant insights into the ways in which we can detect North Korean provocations. International Relations of the Asia-Pacific Vol. 0 No. 0 # The Author 2016. Published by Oxford University Press in association with the Japan Association of International Relations; all rights reserved. For permissions, please email: [email protected] International Relations of the Asia-Pacific Volume 0, (2016) 1–28 doi: 10.1093/irap/lcw016 International Relations of the Asia-Pacific Advance Access published October 19, 2016 by guest on October 23, 2016 http://irap.oxfordjournals.org/ Downloaded from

Upload: others

Post on 26-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

Detecting patterns in NorthKorean military provocationswhat machine-learningtells usTaehee Whang

1

Michael Lammbrau2

andHyung-min Joo

3

1Department of Political Science amp International StudiesSeoul South Korea 2Department of Intelligence StudiesMercyhurst University Erie USA 3Department of PoliticalScience amp International Relations Seoul South KoreaEmail hjookoreaackr

Accepted 11 September 2016

Abstract

For the past two decades North Korea has made a series of military

provocations destabilizing the regional security of East Asia In particu-

lar Pyongyang has launched several conventional attacks on South

Korea Although these attacks seem unpredictable and random we at-

tempt in this article to find some patterns in North Korean provoca-

tions To this end we employ a machine-learning technique to analyze

news articles of the Korean Central News Agency (KCNA) from 1997 to

2013 Based on five key words (lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo and

lsquoJapanesersquo) our model identifies North Korean provocations with 82

accuracy Further investigation into these attack words and the contexts

in which they appear produces significant insights into the ways in

which we can detect North Korean provocations

International Relations of the Asia-Pacific Vol 0 No 0 The Author 2016 Published by Oxford University Press in association with the

Japan Association of International Relations all rights reservedFor permissions please email journalspermissionsoupcom

International Relations of the Asia-Pacific Volume 0 (2016) 1ndash28doi 101093iraplcw016

International Relations of the Asia-Pacific Advance Access published October 19 2016 by guest on O

ctober 23 2016httpirapoxfordjournalsorg

Dow

nloaded from

1 Introduction

Since the conclusion of the Korean War (1950ndash53) there has been sig-nificant violence along the Demilitarized Zone (DMZ) In additionthere have been multiple military clashes at sea and elsewhere alongthe coast of South Korea What is the main motivation of Pyongyangto initiate these attacks Is it possible to analyze North Korean mili-tary provocations systematically In answering these questions a pri-mary obstacle has been the relative dearth of information regarding theintentions of Pyongyang As one put it North Korea has been lsquothelongest running intelligence failurersquo (Litwak 2007) or lsquoNorth Koreacould have been on Mars for [the outside world] knew about it It wasa faraway land of unknowns and unknowables explored mostly byspace probes and in this case spy satellitesrsquo (Sigal 1998)

Existing theories of IR fail to provide an adequate guide to under-stand North Korea military provocations For instance considerrealism According to Mearsheimer (2001) three factors affect powercalculations of a country possession of nuclear forces separation bylarge bodies of water and a power distribution Among them the firsttwo cannot explain variations in North Korean military provocationsbecause Pyongyang has the upper hand in nuclear capability againstSeoul and a territorial proximity between the two Koreas is a constantBy contrast the distribution of power can influence the extent of fearthat North Korea may have because of increasing power asymmetrysince the collapse of the Soviet bloc What is unclear however is thelevel of resolve North Korean has to initiate military attacks Withoutdata to estimate how willing Pyongyang is to use force realism is lim-ited in explaining how profound North Korean fears are and hencewhy Pyongyang resorts to violent military attacks occasionally

To cope with the paucity of reliable information two lines of re-search have emerged in North Korean studies a surveyinterview ofNorth Korean refugees and an analysis of North Korean newspapersAbout the latter the Korea Central News Agency (KCNA) has pub-lished government-approved articles since 1996 In recent years severalscholars have focused on the KCNA in hopes of distilling useful insightfrom it (McEachern 2010 Rich 2012ab Joo 2014) Based on count-ing the frequency with which particular terms or phrases appear inKCNA news articles these works have yielded some interesting findings

2 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

about the linguistic features of KCNA articles and their correlationwith nuclear policies political rhetoric economic trends and socialchanges of North Korea

By contrast this article embarks on a new approach a text-classification approach based on a supervised machine-learning tech-nique In particular we develop a model that can distinguish the periodof imminent North Korean provocations from peace time by using theKCNA as our data and supervised machine-learning as our methodFor our cases we select all five North Korean attacks between 1997and 2013 (i) the First Battle of Yonpyong on 15 June 1999 (ii) theSecond Battle of Yonpyong on 29 June 2002 (iii) the Battle ofDaechong on 10 November 2009 (iv) the sinking of the Cheonan navalship on 26 March 2010 (v) the shelling at Yonpyong Island on 23November 2010 Each of these incidents is a conventional NorthKorean attack resulting in more than one casualty on one or bothsides

Our analysis shows that immediately prior to North Korean at-tacks Pyongyang tends to increase its use of five key terms in theKCNA lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo and lsquoJapanesersquo Our in-vestigation into the contexts in which they appeared in KCNA arti-cles shows that Pyongyang often employs terms like lsquoJunersquo lsquoyearsrsquoand lsquoJapanesersquo to nostalgically invoke past battles of Kim Il-sungagainst Japanese colonialism Moreover the term lsquosignedrsquo indicatesthat the KCNA quotes lsquoofficial commentariesrsquo published in RodongSinmun the mouthpiece of the ruling Workersrsquo Party of Korea(WPK) right before an attack Finally a social network analysis ofkey terms shows that the word lsquoassemblyrsquo refers to the SupremePeoplersquos Assembly (SPA) The high correlation of the term lsquoassem-blyrsquo with military attacks allows us to conjecture that provocationsare often premeditated insofar as they are preceded by increasedSPA activities These findings provide us with a basis for further re-search into North Korean military provocations By differentiatingthreat articles from non-threat items our model can serve as a use-ful indicator for imminent North Korean aggressions

Detecting patterns in North Korean military provocations 3

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

11 Logistics data and cases

111 Data KCNA

As the sole news agency of North Korea the KCNA provides daily re-ports of North Korean newspapers (eg Rodong Sinmun of the rulingWPK and Minju Chosun of the North Korean government) televisionbroadcasts (eg Korean Central Television) and radio broadcasts (egthe Korean Central Broadcasting System) Since 1996 the KCNA haspublished daily news articles via its server in Japan (wwwkcnacojp)On a typical day the KCNA publishes 20ndash40 articles including reportson activities of the ruling North Korean elite official statements of theNorth Korean government (eg an official statement from the Ministryof Foreign Affairs on nuclear issues) several articles selected fromRodong Sinmun and Minju Chosun miscellaneous news about NorthKorean society and reports of recent developments in foreigncountries

Scholars working on North Korea have relied on its two mediasRodong Sinmun and the KCNA As the newspaper of the WPKRodong Sinmun is regarded as the official mouthpiece of the NorthKorean regime As a result it has become popular for scholars espe-cially in South Korea to conduct content analyses of Rodong Sinmunin order to identify trends or policy shifts of the North Korean govern-ment (Koh 2013) From our viewpoint however Rodong Sinmun hastwo weaknesses First it is inappropriate for our project because its pri-mary target is the domestic audience of North Korea (thus publishedonly in Korean) Given that Rodong Sinmun is written for a domesticaudience it is not a proper place to look for signs patterns or indica-tors of Pyongyang that its relations with the outside world (especiallythe US and South Korea) are at a breaking point and that a militaryconflict of some sort is about to occur Instead the KCNA with its fo-cus on foreign audiences (thus published in four different languages ndashEnglish Spanish Russian and Korean) provides a better source to de-tect such signs or patterns Second the KCNA provides a better data-set from a technical viewpoint Although North Korea has madeRodong Sinmun available on internet (wwwkcnacojptoday-rodongrodonghtm) in recent years only few selected articles after 2002 are

4 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

available while only titles are provided for the rest of Rodong Sinmunarticles By contrast all the articles in the KCNA after 1996 are avail-able on the internet thus providing better data for our machine-basedtext-classification analysis to detect any signs patterns or indicatorsfrom Pyongyang that a military strike is likely to occur As a resultthe KCNA is used as the main source of data

Although our case selection of conventional North Korean provoca-tions begins in 1997 because KCNA data is available after that year itis more than a technical convenience to use the year 1997 as a startingpoint It also overlaps with the succession process from Kim Il-sung toKim Jong-il When Kim Il-sung passed away on 8 July 1994 KimJong-il took the traditional three-year mourning period (1994ndash96) be-fore he assumed the official title of General Secretary of WPK in 1997to rule the country As a result the year 1997 serves as a starting pointin our project not only for a technical reason (ie the availability of theKCNA dataset after 1996) but also for a substantive reason (ie it in-cluded North Korean military provocations in the post-Kim Il-sungera) In particular Pyongyang has launched five conventional militarystrikes in the post-Kim Il-sung period

12 Cases five conventional military crises since 1997

Figure 1 shows five North Korean conventional military attacks be-tween 1996 and 2013 (i) the First Battle of Yonpyong on 15 June1999 (ii) the Second Battle of Yonpyong on 29 June 2002 (iii) theBattle of Daechong on 10 November 2009 (iv) the sinking of theCheonan naval ship on 26 March 2010 and(v) the shelling atYonpyong Island on 23 November 2010 For our case selection threecriteria are used First each of these attacks caused one or more casu-alties on at least one side Second all five attacks used conventionalnon-nuclear weapons1 Third all five attacks were initiated by North

1 In this article we have excluded all events associated with the North Korean nuclear crisisIn a separate project however we employ a machine-learning technique to develop to de-tect significant signs or patterns of Pyongyang that it is about to conduct a nuclear testPreliminary research shows an interesting contrast between conventional provocations andnuclear crisis in the North Korean case As will be shown a single platform (covering theentire period from Kim Jong-Il to the Kim Jong-Un) outperforms a double platform (onemodel for the KJI period and another model for the KJU era) in the case of North Koreanconventional provocations Simply put there has not been a major policy shift in

Detecting patterns in North Korean military provocations 5

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Korea Below is a brief description of the five North Korean militaryprovocations

121 The first battle of Yonpyong15June1999

In 1999 Pyongyang claimed that South Korea trespassed the NorthernLimit Line (NLL) in the Yellow Sea On 7 June 1999 when NorthKorean patrol ships and fishing boats crossed the NLL the SouthKorean navy responded by increasing its patrol After a few days ofcontinual trespassing the South Korean government issued a warningand deployed two patrol corvettes (Hong 2012) When Pyongyangcontinued to ignore warnings the South Korean navy blocked NorthKorean boats ramming them (Macfie 2013) After a few skirmishesthe main battle occurred on 15 June 1999 when four North Korean pa-trol ships trespassed across the NLL soon joined by three NorthKorean torpedo boats and three additional patrol ships With a totalof ten battleships North Koreans launched a 25 mm cannon shell

Figure 1 Five conventional North Korean military attacks

Pyongyang as far as its conventional provocations are concerned By contrast a doubleplatform outperforms a single platform in the case of the North Korean nuclear crisis indi-cating a major policy shift from Kim Jong-Il to Kim Jong-Un At the moment we are try-ing to find out whether such a shift indicates an increasing intention of the new NorthKorean leadership under Kim Jong-Un to proclaim a lsquonuclear powerrsquo status by developingits nuclear program further instead of negotiating over it as his father had done on severaloccasions

6 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

(Park 2009) In response the South Korean navy fired with 40 mmand 76 mm machine guns When the battle was over a North Koreantorpedo boat sank a large patrol ship crashed and four patrol shipssustained damage In the process approximately 30 North Korean sol-diers were killed and 70 were wounded As for South Korea four pa-trol killers and one patrol corvette were damaged with nine soldierswounded (Moore and Hutchison 2010)

122 The second battle of Yonpyong 29 June 2002

By 2002 North Korean ships frequently crossed the NLL only to bechased back by South Korean patrol vessels On 29 June 2002 twoNorth Korean patrol ships crossed the border ignoring warnings fromSouth Korean navy speedboats At 1025 am the two North Koreanpatrol boats attacked nearby South Korean vessels with its 85 mmgun 25 mm auxiliary gun and hand carried rockets In response theSouth Korean patrol ships returned fire (Sohn 2002) A 20-minute bat-tle resulted in the death of four South Korean marines one missingand 18 wounded On the North Korean side approximately 30 sailorswere killed or injured While South Korean vessels were partly dam-aged one of the North Korean vessels was towed away in flames(Global Security 2002)

123 The battle of daechong 10 November 2009

On 10 November 2009 a North Korean patrol vessel trespassed theNLL Soon after two 130-ton South Korean vessels issued warningsbut the North Korean vessel ignored them When the South Koreanships fired warning shots the North Korean vessel began firing leadingto a 2-minute battle near Daechong Island located 18 miles off theNorth Korean coast The North Korean patrol vessel also attacked aSouth Korean high-speed patrol vessel (Kim 2009) In response theSouth Korean vessel countered with approximately 200 shots Whenthe battle was over there were no South Korean casualties but NorthKorea suffered one casualty and three injuries with its naval vessellsquowrapped in flamesrsquo (Choe 2009)

Detecting patterns in North Korean military provocations 7

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

124 Sinking of the cheonan naval ship 26 March 2010

On 26 March 2010 the South Korean naval ship Cheonan sank intothe Yellow Sea At 922 pm an explosion occurred in the 1200-tonwarship that was sailing by Baengnyong Island where the two Koreashad clashed numerous times before (Cha 2010) In the end 46 liveswere lost and it was quickly suspected that the ship lsquohad been hit byan external lsquonon-contactrsquo explosionrsquo with North Korea as the primesuspect (Sudworth 2010) Pyongyang denied its involvement andclaimed that the incident had been contrived by South Korea A six-week-long investigation by international experts proved the involve-ment of North Korea in the attack

125 Shelling of Yonpyong island 23 November 2010

On 23 November 2010 North Korea fired artillery shells at YonpyongIsland near the NLL About 200 shells hit the island and set fire todozens of buildings The barrage killed two South Korean citizens andtwo marines while injuring three civilians and 17 soldiers The attackbegan when South Korea was practicing military drills near the NLLdespite North Korean warnings North Korea fired three separate bar-rages with dozens of artillery shells in each barrage In return SouthKorea responded by firing 80 rounds from K9 155 mm self-propelledHowitzers (Kim and Kim 2011) When the battle was over more than50 civilian homes were in flames

13 Research method supervised machine learning

Although the North Korean nuclear crisis has been the focus of the in-ternational community in recent years the history of North Koreanmilitary provocations dates much further back For instancePyongyang destabilized an already precarious security environment inthe Korean Peninsula by launching a series of provocations such asthe hijacking of a South Korean airline (1958) the Korean DMZConflict known as lsquothe Second Korean Warrsquo with more than 700 casu-alties (1969) the hijacking of the USS Pueblo (1968) the notoriousAxe Murder Incident (1976) the bombing of the Korean Air Line 858(1987) and so on

8 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Not surprisingly many scholars have attempted to analyze NorthKorean military provocations trying to identify their lsquocausesrsquo (Jung2013 Lee 2014 Ko 2015) main lsquogoalsrsquo in those attacks (Jung 2008Kang 2013) proper lsquopoliciesrsquo to curtail further threats (Oh 2011Kim 2012) and so on Despite sincere efforts earlier studies sufferedfrom one notorious problem a lack of reliable data As one put itNorth Korea has been lsquothe blackest of black holesrsquo (Litwak 2007289) Under such circumstances scholars had little choice but to makeeducated guesses ndash lsquoguesstimationsrsquo ndash regarding unknown intentionsgoals or likely moves of the North Korean regime As a result the ex-isting literature on North Korean military provocations has beendriven less by reliable data but more by subjective interpretations

By contrast we rely on the official North Korean media KCNA Asfor the method we use a supervised machine-learning technology tomaximize the use of the KCNA dataset that covers various aspects ofNorth Korea from 1997 to 2013 The main advantage of our methodcomes from the quality of the data that is our findings are data-driven and thus more objective than previous works relying on subjec-tive interpretation of selected observations Given the paucity ofreliable information on North Korea it is important to analyze thecontent of official KCNA articles that are publicly available A super-vised machine-learning technology is the optimal method to processsuch data objectively

Supervised machine-learning consists of three steps(i) data collec-tion and document labelling (ii) pre-processing and (iii) model extrac-tion and analysis2 First we obtained our dataset from the KCNAwebsite (httpwwwkcnacojp) Since there are five North Korean mili-tary attacks in our case we pulled 10 tranches of articles from all theKCNA articles published from 1997 to 2013 We then labelled five ofthese tranches lsquothreatrsquo while labelling the other five as lsquonon-threatrsquo All

2 We tried a variety of machine learning algorithms such as Random Forests SupportVector Machines and Conditional Inference Trees Also we cross-validated their resultsOur findings demonstrate that these algorithms produced similar results Moreover all ofthese algorithms identified the similar pattern-detecting terms for classifying the KCNA ar-ticles In this article we reported the results based on the Conditional Inference Tree algo-rithm which ran the tree-structured regression models through binary recursivepartitioning in a conditional inference framework For more details on the ConditionalInference Tree algorithm please see the R Package lsquoPartyrsquo for Conditional Inference Trees(lsquoctreersquo) at httpcranr-projectorgwebpackagespartypartypdf

Detecting patterns in North Korean military provocations 9

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 2: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

1 Introduction

Since the conclusion of the Korean War (1950ndash53) there has been sig-nificant violence along the Demilitarized Zone (DMZ) In additionthere have been multiple military clashes at sea and elsewhere alongthe coast of South Korea What is the main motivation of Pyongyangto initiate these attacks Is it possible to analyze North Korean mili-tary provocations systematically In answering these questions a pri-mary obstacle has been the relative dearth of information regarding theintentions of Pyongyang As one put it North Korea has been lsquothelongest running intelligence failurersquo (Litwak 2007) or lsquoNorth Koreacould have been on Mars for [the outside world] knew about it It wasa faraway land of unknowns and unknowables explored mostly byspace probes and in this case spy satellitesrsquo (Sigal 1998)

Existing theories of IR fail to provide an adequate guide to under-stand North Korea military provocations For instance considerrealism According to Mearsheimer (2001) three factors affect powercalculations of a country possession of nuclear forces separation bylarge bodies of water and a power distribution Among them the firsttwo cannot explain variations in North Korean military provocationsbecause Pyongyang has the upper hand in nuclear capability againstSeoul and a territorial proximity between the two Koreas is a constantBy contrast the distribution of power can influence the extent of fearthat North Korea may have because of increasing power asymmetrysince the collapse of the Soviet bloc What is unclear however is thelevel of resolve North Korean has to initiate military attacks Withoutdata to estimate how willing Pyongyang is to use force realism is lim-ited in explaining how profound North Korean fears are and hencewhy Pyongyang resorts to violent military attacks occasionally

To cope with the paucity of reliable information two lines of re-search have emerged in North Korean studies a surveyinterview ofNorth Korean refugees and an analysis of North Korean newspapersAbout the latter the Korea Central News Agency (KCNA) has pub-lished government-approved articles since 1996 In recent years severalscholars have focused on the KCNA in hopes of distilling useful insightfrom it (McEachern 2010 Rich 2012ab Joo 2014) Based on count-ing the frequency with which particular terms or phrases appear inKCNA news articles these works have yielded some interesting findings

2 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

about the linguistic features of KCNA articles and their correlationwith nuclear policies political rhetoric economic trends and socialchanges of North Korea

By contrast this article embarks on a new approach a text-classification approach based on a supervised machine-learning tech-nique In particular we develop a model that can distinguish the periodof imminent North Korean provocations from peace time by using theKCNA as our data and supervised machine-learning as our methodFor our cases we select all five North Korean attacks between 1997and 2013 (i) the First Battle of Yonpyong on 15 June 1999 (ii) theSecond Battle of Yonpyong on 29 June 2002 (iii) the Battle ofDaechong on 10 November 2009 (iv) the sinking of the Cheonan navalship on 26 March 2010 (v) the shelling at Yonpyong Island on 23November 2010 Each of these incidents is a conventional NorthKorean attack resulting in more than one casualty on one or bothsides

Our analysis shows that immediately prior to North Korean at-tacks Pyongyang tends to increase its use of five key terms in theKCNA lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo and lsquoJapanesersquo Our in-vestigation into the contexts in which they appeared in KCNA arti-cles shows that Pyongyang often employs terms like lsquoJunersquo lsquoyearsrsquoand lsquoJapanesersquo to nostalgically invoke past battles of Kim Il-sungagainst Japanese colonialism Moreover the term lsquosignedrsquo indicatesthat the KCNA quotes lsquoofficial commentariesrsquo published in RodongSinmun the mouthpiece of the ruling Workersrsquo Party of Korea(WPK) right before an attack Finally a social network analysis ofkey terms shows that the word lsquoassemblyrsquo refers to the SupremePeoplersquos Assembly (SPA) The high correlation of the term lsquoassem-blyrsquo with military attacks allows us to conjecture that provocationsare often premeditated insofar as they are preceded by increasedSPA activities These findings provide us with a basis for further re-search into North Korean military provocations By differentiatingthreat articles from non-threat items our model can serve as a use-ful indicator for imminent North Korean aggressions

Detecting patterns in North Korean military provocations 3

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

11 Logistics data and cases

111 Data KCNA

As the sole news agency of North Korea the KCNA provides daily re-ports of North Korean newspapers (eg Rodong Sinmun of the rulingWPK and Minju Chosun of the North Korean government) televisionbroadcasts (eg Korean Central Television) and radio broadcasts (egthe Korean Central Broadcasting System) Since 1996 the KCNA haspublished daily news articles via its server in Japan (wwwkcnacojp)On a typical day the KCNA publishes 20ndash40 articles including reportson activities of the ruling North Korean elite official statements of theNorth Korean government (eg an official statement from the Ministryof Foreign Affairs on nuclear issues) several articles selected fromRodong Sinmun and Minju Chosun miscellaneous news about NorthKorean society and reports of recent developments in foreigncountries

Scholars working on North Korea have relied on its two mediasRodong Sinmun and the KCNA As the newspaper of the WPKRodong Sinmun is regarded as the official mouthpiece of the NorthKorean regime As a result it has become popular for scholars espe-cially in South Korea to conduct content analyses of Rodong Sinmunin order to identify trends or policy shifts of the North Korean govern-ment (Koh 2013) From our viewpoint however Rodong Sinmun hastwo weaknesses First it is inappropriate for our project because its pri-mary target is the domestic audience of North Korea (thus publishedonly in Korean) Given that Rodong Sinmun is written for a domesticaudience it is not a proper place to look for signs patterns or indica-tors of Pyongyang that its relations with the outside world (especiallythe US and South Korea) are at a breaking point and that a militaryconflict of some sort is about to occur Instead the KCNA with its fo-cus on foreign audiences (thus published in four different languages ndashEnglish Spanish Russian and Korean) provides a better source to de-tect such signs or patterns Second the KCNA provides a better data-set from a technical viewpoint Although North Korea has madeRodong Sinmun available on internet (wwwkcnacojptoday-rodongrodonghtm) in recent years only few selected articles after 2002 are

4 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

available while only titles are provided for the rest of Rodong Sinmunarticles By contrast all the articles in the KCNA after 1996 are avail-able on the internet thus providing better data for our machine-basedtext-classification analysis to detect any signs patterns or indicatorsfrom Pyongyang that a military strike is likely to occur As a resultthe KCNA is used as the main source of data

Although our case selection of conventional North Korean provoca-tions begins in 1997 because KCNA data is available after that year itis more than a technical convenience to use the year 1997 as a startingpoint It also overlaps with the succession process from Kim Il-sung toKim Jong-il When Kim Il-sung passed away on 8 July 1994 KimJong-il took the traditional three-year mourning period (1994ndash96) be-fore he assumed the official title of General Secretary of WPK in 1997to rule the country As a result the year 1997 serves as a starting pointin our project not only for a technical reason (ie the availability of theKCNA dataset after 1996) but also for a substantive reason (ie it in-cluded North Korean military provocations in the post-Kim Il-sungera) In particular Pyongyang has launched five conventional militarystrikes in the post-Kim Il-sung period

12 Cases five conventional military crises since 1997

Figure 1 shows five North Korean conventional military attacks be-tween 1996 and 2013 (i) the First Battle of Yonpyong on 15 June1999 (ii) the Second Battle of Yonpyong on 29 June 2002 (iii) theBattle of Daechong on 10 November 2009 (iv) the sinking of theCheonan naval ship on 26 March 2010 and(v) the shelling atYonpyong Island on 23 November 2010 For our case selection threecriteria are used First each of these attacks caused one or more casu-alties on at least one side Second all five attacks used conventionalnon-nuclear weapons1 Third all five attacks were initiated by North

1 In this article we have excluded all events associated with the North Korean nuclear crisisIn a separate project however we employ a machine-learning technique to develop to de-tect significant signs or patterns of Pyongyang that it is about to conduct a nuclear testPreliminary research shows an interesting contrast between conventional provocations andnuclear crisis in the North Korean case As will be shown a single platform (covering theentire period from Kim Jong-Il to the Kim Jong-Un) outperforms a double platform (onemodel for the KJI period and another model for the KJU era) in the case of North Koreanconventional provocations Simply put there has not been a major policy shift in

Detecting patterns in North Korean military provocations 5

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Korea Below is a brief description of the five North Korean militaryprovocations

121 The first battle of Yonpyong15June1999

In 1999 Pyongyang claimed that South Korea trespassed the NorthernLimit Line (NLL) in the Yellow Sea On 7 June 1999 when NorthKorean patrol ships and fishing boats crossed the NLL the SouthKorean navy responded by increasing its patrol After a few days ofcontinual trespassing the South Korean government issued a warningand deployed two patrol corvettes (Hong 2012) When Pyongyangcontinued to ignore warnings the South Korean navy blocked NorthKorean boats ramming them (Macfie 2013) After a few skirmishesthe main battle occurred on 15 June 1999 when four North Korean pa-trol ships trespassed across the NLL soon joined by three NorthKorean torpedo boats and three additional patrol ships With a totalof ten battleships North Koreans launched a 25 mm cannon shell

Figure 1 Five conventional North Korean military attacks

Pyongyang as far as its conventional provocations are concerned By contrast a doubleplatform outperforms a single platform in the case of the North Korean nuclear crisis indi-cating a major policy shift from Kim Jong-Il to Kim Jong-Un At the moment we are try-ing to find out whether such a shift indicates an increasing intention of the new NorthKorean leadership under Kim Jong-Un to proclaim a lsquonuclear powerrsquo status by developingits nuclear program further instead of negotiating over it as his father had done on severaloccasions

6 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

(Park 2009) In response the South Korean navy fired with 40 mmand 76 mm machine guns When the battle was over a North Koreantorpedo boat sank a large patrol ship crashed and four patrol shipssustained damage In the process approximately 30 North Korean sol-diers were killed and 70 were wounded As for South Korea four pa-trol killers and one patrol corvette were damaged with nine soldierswounded (Moore and Hutchison 2010)

122 The second battle of Yonpyong 29 June 2002

By 2002 North Korean ships frequently crossed the NLL only to bechased back by South Korean patrol vessels On 29 June 2002 twoNorth Korean patrol ships crossed the border ignoring warnings fromSouth Korean navy speedboats At 1025 am the two North Koreanpatrol boats attacked nearby South Korean vessels with its 85 mmgun 25 mm auxiliary gun and hand carried rockets In response theSouth Korean patrol ships returned fire (Sohn 2002) A 20-minute bat-tle resulted in the death of four South Korean marines one missingand 18 wounded On the North Korean side approximately 30 sailorswere killed or injured While South Korean vessels were partly dam-aged one of the North Korean vessels was towed away in flames(Global Security 2002)

123 The battle of daechong 10 November 2009

On 10 November 2009 a North Korean patrol vessel trespassed theNLL Soon after two 130-ton South Korean vessels issued warningsbut the North Korean vessel ignored them When the South Koreanships fired warning shots the North Korean vessel began firing leadingto a 2-minute battle near Daechong Island located 18 miles off theNorth Korean coast The North Korean patrol vessel also attacked aSouth Korean high-speed patrol vessel (Kim 2009) In response theSouth Korean vessel countered with approximately 200 shots Whenthe battle was over there were no South Korean casualties but NorthKorea suffered one casualty and three injuries with its naval vessellsquowrapped in flamesrsquo (Choe 2009)

Detecting patterns in North Korean military provocations 7

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

124 Sinking of the cheonan naval ship 26 March 2010

On 26 March 2010 the South Korean naval ship Cheonan sank intothe Yellow Sea At 922 pm an explosion occurred in the 1200-tonwarship that was sailing by Baengnyong Island where the two Koreashad clashed numerous times before (Cha 2010) In the end 46 liveswere lost and it was quickly suspected that the ship lsquohad been hit byan external lsquonon-contactrsquo explosionrsquo with North Korea as the primesuspect (Sudworth 2010) Pyongyang denied its involvement andclaimed that the incident had been contrived by South Korea A six-week-long investigation by international experts proved the involve-ment of North Korea in the attack

125 Shelling of Yonpyong island 23 November 2010

On 23 November 2010 North Korea fired artillery shells at YonpyongIsland near the NLL About 200 shells hit the island and set fire todozens of buildings The barrage killed two South Korean citizens andtwo marines while injuring three civilians and 17 soldiers The attackbegan when South Korea was practicing military drills near the NLLdespite North Korean warnings North Korea fired three separate bar-rages with dozens of artillery shells in each barrage In return SouthKorea responded by firing 80 rounds from K9 155 mm self-propelledHowitzers (Kim and Kim 2011) When the battle was over more than50 civilian homes were in flames

13 Research method supervised machine learning

Although the North Korean nuclear crisis has been the focus of the in-ternational community in recent years the history of North Koreanmilitary provocations dates much further back For instancePyongyang destabilized an already precarious security environment inthe Korean Peninsula by launching a series of provocations such asthe hijacking of a South Korean airline (1958) the Korean DMZConflict known as lsquothe Second Korean Warrsquo with more than 700 casu-alties (1969) the hijacking of the USS Pueblo (1968) the notoriousAxe Murder Incident (1976) the bombing of the Korean Air Line 858(1987) and so on

8 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Not surprisingly many scholars have attempted to analyze NorthKorean military provocations trying to identify their lsquocausesrsquo (Jung2013 Lee 2014 Ko 2015) main lsquogoalsrsquo in those attacks (Jung 2008Kang 2013) proper lsquopoliciesrsquo to curtail further threats (Oh 2011Kim 2012) and so on Despite sincere efforts earlier studies sufferedfrom one notorious problem a lack of reliable data As one put itNorth Korea has been lsquothe blackest of black holesrsquo (Litwak 2007289) Under such circumstances scholars had little choice but to makeeducated guesses ndash lsquoguesstimationsrsquo ndash regarding unknown intentionsgoals or likely moves of the North Korean regime As a result the ex-isting literature on North Korean military provocations has beendriven less by reliable data but more by subjective interpretations

By contrast we rely on the official North Korean media KCNA Asfor the method we use a supervised machine-learning technology tomaximize the use of the KCNA dataset that covers various aspects ofNorth Korea from 1997 to 2013 The main advantage of our methodcomes from the quality of the data that is our findings are data-driven and thus more objective than previous works relying on subjec-tive interpretation of selected observations Given the paucity ofreliable information on North Korea it is important to analyze thecontent of official KCNA articles that are publicly available A super-vised machine-learning technology is the optimal method to processsuch data objectively

Supervised machine-learning consists of three steps(i) data collec-tion and document labelling (ii) pre-processing and (iii) model extrac-tion and analysis2 First we obtained our dataset from the KCNAwebsite (httpwwwkcnacojp) Since there are five North Korean mili-tary attacks in our case we pulled 10 tranches of articles from all theKCNA articles published from 1997 to 2013 We then labelled five ofthese tranches lsquothreatrsquo while labelling the other five as lsquonon-threatrsquo All

2 We tried a variety of machine learning algorithms such as Random Forests SupportVector Machines and Conditional Inference Trees Also we cross-validated their resultsOur findings demonstrate that these algorithms produced similar results Moreover all ofthese algorithms identified the similar pattern-detecting terms for classifying the KCNA ar-ticles In this article we reported the results based on the Conditional Inference Tree algo-rithm which ran the tree-structured regression models through binary recursivepartitioning in a conditional inference framework For more details on the ConditionalInference Tree algorithm please see the R Package lsquoPartyrsquo for Conditional Inference Trees(lsquoctreersquo) at httpcranr-projectorgwebpackagespartypartypdf

Detecting patterns in North Korean military provocations 9

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 3: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

about the linguistic features of KCNA articles and their correlationwith nuclear policies political rhetoric economic trends and socialchanges of North Korea

By contrast this article embarks on a new approach a text-classification approach based on a supervised machine-learning tech-nique In particular we develop a model that can distinguish the periodof imminent North Korean provocations from peace time by using theKCNA as our data and supervised machine-learning as our methodFor our cases we select all five North Korean attacks between 1997and 2013 (i) the First Battle of Yonpyong on 15 June 1999 (ii) theSecond Battle of Yonpyong on 29 June 2002 (iii) the Battle ofDaechong on 10 November 2009 (iv) the sinking of the Cheonan navalship on 26 March 2010 (v) the shelling at Yonpyong Island on 23November 2010 Each of these incidents is a conventional NorthKorean attack resulting in more than one casualty on one or bothsides

Our analysis shows that immediately prior to North Korean at-tacks Pyongyang tends to increase its use of five key terms in theKCNA lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo and lsquoJapanesersquo Our in-vestigation into the contexts in which they appeared in KCNA arti-cles shows that Pyongyang often employs terms like lsquoJunersquo lsquoyearsrsquoand lsquoJapanesersquo to nostalgically invoke past battles of Kim Il-sungagainst Japanese colonialism Moreover the term lsquosignedrsquo indicatesthat the KCNA quotes lsquoofficial commentariesrsquo published in RodongSinmun the mouthpiece of the ruling Workersrsquo Party of Korea(WPK) right before an attack Finally a social network analysis ofkey terms shows that the word lsquoassemblyrsquo refers to the SupremePeoplersquos Assembly (SPA) The high correlation of the term lsquoassem-blyrsquo with military attacks allows us to conjecture that provocationsare often premeditated insofar as they are preceded by increasedSPA activities These findings provide us with a basis for further re-search into North Korean military provocations By differentiatingthreat articles from non-threat items our model can serve as a use-ful indicator for imminent North Korean aggressions

Detecting patterns in North Korean military provocations 3

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

11 Logistics data and cases

111 Data KCNA

As the sole news agency of North Korea the KCNA provides daily re-ports of North Korean newspapers (eg Rodong Sinmun of the rulingWPK and Minju Chosun of the North Korean government) televisionbroadcasts (eg Korean Central Television) and radio broadcasts (egthe Korean Central Broadcasting System) Since 1996 the KCNA haspublished daily news articles via its server in Japan (wwwkcnacojp)On a typical day the KCNA publishes 20ndash40 articles including reportson activities of the ruling North Korean elite official statements of theNorth Korean government (eg an official statement from the Ministryof Foreign Affairs on nuclear issues) several articles selected fromRodong Sinmun and Minju Chosun miscellaneous news about NorthKorean society and reports of recent developments in foreigncountries

Scholars working on North Korea have relied on its two mediasRodong Sinmun and the KCNA As the newspaper of the WPKRodong Sinmun is regarded as the official mouthpiece of the NorthKorean regime As a result it has become popular for scholars espe-cially in South Korea to conduct content analyses of Rodong Sinmunin order to identify trends or policy shifts of the North Korean govern-ment (Koh 2013) From our viewpoint however Rodong Sinmun hastwo weaknesses First it is inappropriate for our project because its pri-mary target is the domestic audience of North Korea (thus publishedonly in Korean) Given that Rodong Sinmun is written for a domesticaudience it is not a proper place to look for signs patterns or indica-tors of Pyongyang that its relations with the outside world (especiallythe US and South Korea) are at a breaking point and that a militaryconflict of some sort is about to occur Instead the KCNA with its fo-cus on foreign audiences (thus published in four different languages ndashEnglish Spanish Russian and Korean) provides a better source to de-tect such signs or patterns Second the KCNA provides a better data-set from a technical viewpoint Although North Korea has madeRodong Sinmun available on internet (wwwkcnacojptoday-rodongrodonghtm) in recent years only few selected articles after 2002 are

4 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

available while only titles are provided for the rest of Rodong Sinmunarticles By contrast all the articles in the KCNA after 1996 are avail-able on the internet thus providing better data for our machine-basedtext-classification analysis to detect any signs patterns or indicatorsfrom Pyongyang that a military strike is likely to occur As a resultthe KCNA is used as the main source of data

Although our case selection of conventional North Korean provoca-tions begins in 1997 because KCNA data is available after that year itis more than a technical convenience to use the year 1997 as a startingpoint It also overlaps with the succession process from Kim Il-sung toKim Jong-il When Kim Il-sung passed away on 8 July 1994 KimJong-il took the traditional three-year mourning period (1994ndash96) be-fore he assumed the official title of General Secretary of WPK in 1997to rule the country As a result the year 1997 serves as a starting pointin our project not only for a technical reason (ie the availability of theKCNA dataset after 1996) but also for a substantive reason (ie it in-cluded North Korean military provocations in the post-Kim Il-sungera) In particular Pyongyang has launched five conventional militarystrikes in the post-Kim Il-sung period

12 Cases five conventional military crises since 1997

Figure 1 shows five North Korean conventional military attacks be-tween 1996 and 2013 (i) the First Battle of Yonpyong on 15 June1999 (ii) the Second Battle of Yonpyong on 29 June 2002 (iii) theBattle of Daechong on 10 November 2009 (iv) the sinking of theCheonan naval ship on 26 March 2010 and(v) the shelling atYonpyong Island on 23 November 2010 For our case selection threecriteria are used First each of these attacks caused one or more casu-alties on at least one side Second all five attacks used conventionalnon-nuclear weapons1 Third all five attacks were initiated by North

1 In this article we have excluded all events associated with the North Korean nuclear crisisIn a separate project however we employ a machine-learning technique to develop to de-tect significant signs or patterns of Pyongyang that it is about to conduct a nuclear testPreliminary research shows an interesting contrast between conventional provocations andnuclear crisis in the North Korean case As will be shown a single platform (covering theentire period from Kim Jong-Il to the Kim Jong-Un) outperforms a double platform (onemodel for the KJI period and another model for the KJU era) in the case of North Koreanconventional provocations Simply put there has not been a major policy shift in

Detecting patterns in North Korean military provocations 5

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Korea Below is a brief description of the five North Korean militaryprovocations

121 The first battle of Yonpyong15June1999

In 1999 Pyongyang claimed that South Korea trespassed the NorthernLimit Line (NLL) in the Yellow Sea On 7 June 1999 when NorthKorean patrol ships and fishing boats crossed the NLL the SouthKorean navy responded by increasing its patrol After a few days ofcontinual trespassing the South Korean government issued a warningand deployed two patrol corvettes (Hong 2012) When Pyongyangcontinued to ignore warnings the South Korean navy blocked NorthKorean boats ramming them (Macfie 2013) After a few skirmishesthe main battle occurred on 15 June 1999 when four North Korean pa-trol ships trespassed across the NLL soon joined by three NorthKorean torpedo boats and three additional patrol ships With a totalof ten battleships North Koreans launched a 25 mm cannon shell

Figure 1 Five conventional North Korean military attacks

Pyongyang as far as its conventional provocations are concerned By contrast a doubleplatform outperforms a single platform in the case of the North Korean nuclear crisis indi-cating a major policy shift from Kim Jong-Il to Kim Jong-Un At the moment we are try-ing to find out whether such a shift indicates an increasing intention of the new NorthKorean leadership under Kim Jong-Un to proclaim a lsquonuclear powerrsquo status by developingits nuclear program further instead of negotiating over it as his father had done on severaloccasions

6 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

(Park 2009) In response the South Korean navy fired with 40 mmand 76 mm machine guns When the battle was over a North Koreantorpedo boat sank a large patrol ship crashed and four patrol shipssustained damage In the process approximately 30 North Korean sol-diers were killed and 70 were wounded As for South Korea four pa-trol killers and one patrol corvette were damaged with nine soldierswounded (Moore and Hutchison 2010)

122 The second battle of Yonpyong 29 June 2002

By 2002 North Korean ships frequently crossed the NLL only to bechased back by South Korean patrol vessels On 29 June 2002 twoNorth Korean patrol ships crossed the border ignoring warnings fromSouth Korean navy speedboats At 1025 am the two North Koreanpatrol boats attacked nearby South Korean vessels with its 85 mmgun 25 mm auxiliary gun and hand carried rockets In response theSouth Korean patrol ships returned fire (Sohn 2002) A 20-minute bat-tle resulted in the death of four South Korean marines one missingand 18 wounded On the North Korean side approximately 30 sailorswere killed or injured While South Korean vessels were partly dam-aged one of the North Korean vessels was towed away in flames(Global Security 2002)

123 The battle of daechong 10 November 2009

On 10 November 2009 a North Korean patrol vessel trespassed theNLL Soon after two 130-ton South Korean vessels issued warningsbut the North Korean vessel ignored them When the South Koreanships fired warning shots the North Korean vessel began firing leadingto a 2-minute battle near Daechong Island located 18 miles off theNorth Korean coast The North Korean patrol vessel also attacked aSouth Korean high-speed patrol vessel (Kim 2009) In response theSouth Korean vessel countered with approximately 200 shots Whenthe battle was over there were no South Korean casualties but NorthKorea suffered one casualty and three injuries with its naval vessellsquowrapped in flamesrsquo (Choe 2009)

Detecting patterns in North Korean military provocations 7

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

124 Sinking of the cheonan naval ship 26 March 2010

On 26 March 2010 the South Korean naval ship Cheonan sank intothe Yellow Sea At 922 pm an explosion occurred in the 1200-tonwarship that was sailing by Baengnyong Island where the two Koreashad clashed numerous times before (Cha 2010) In the end 46 liveswere lost and it was quickly suspected that the ship lsquohad been hit byan external lsquonon-contactrsquo explosionrsquo with North Korea as the primesuspect (Sudworth 2010) Pyongyang denied its involvement andclaimed that the incident had been contrived by South Korea A six-week-long investigation by international experts proved the involve-ment of North Korea in the attack

125 Shelling of Yonpyong island 23 November 2010

On 23 November 2010 North Korea fired artillery shells at YonpyongIsland near the NLL About 200 shells hit the island and set fire todozens of buildings The barrage killed two South Korean citizens andtwo marines while injuring three civilians and 17 soldiers The attackbegan when South Korea was practicing military drills near the NLLdespite North Korean warnings North Korea fired three separate bar-rages with dozens of artillery shells in each barrage In return SouthKorea responded by firing 80 rounds from K9 155 mm self-propelledHowitzers (Kim and Kim 2011) When the battle was over more than50 civilian homes were in flames

13 Research method supervised machine learning

Although the North Korean nuclear crisis has been the focus of the in-ternational community in recent years the history of North Koreanmilitary provocations dates much further back For instancePyongyang destabilized an already precarious security environment inthe Korean Peninsula by launching a series of provocations such asthe hijacking of a South Korean airline (1958) the Korean DMZConflict known as lsquothe Second Korean Warrsquo with more than 700 casu-alties (1969) the hijacking of the USS Pueblo (1968) the notoriousAxe Murder Incident (1976) the bombing of the Korean Air Line 858(1987) and so on

8 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Not surprisingly many scholars have attempted to analyze NorthKorean military provocations trying to identify their lsquocausesrsquo (Jung2013 Lee 2014 Ko 2015) main lsquogoalsrsquo in those attacks (Jung 2008Kang 2013) proper lsquopoliciesrsquo to curtail further threats (Oh 2011Kim 2012) and so on Despite sincere efforts earlier studies sufferedfrom one notorious problem a lack of reliable data As one put itNorth Korea has been lsquothe blackest of black holesrsquo (Litwak 2007289) Under such circumstances scholars had little choice but to makeeducated guesses ndash lsquoguesstimationsrsquo ndash regarding unknown intentionsgoals or likely moves of the North Korean regime As a result the ex-isting literature on North Korean military provocations has beendriven less by reliable data but more by subjective interpretations

By contrast we rely on the official North Korean media KCNA Asfor the method we use a supervised machine-learning technology tomaximize the use of the KCNA dataset that covers various aspects ofNorth Korea from 1997 to 2013 The main advantage of our methodcomes from the quality of the data that is our findings are data-driven and thus more objective than previous works relying on subjec-tive interpretation of selected observations Given the paucity ofreliable information on North Korea it is important to analyze thecontent of official KCNA articles that are publicly available A super-vised machine-learning technology is the optimal method to processsuch data objectively

Supervised machine-learning consists of three steps(i) data collec-tion and document labelling (ii) pre-processing and (iii) model extrac-tion and analysis2 First we obtained our dataset from the KCNAwebsite (httpwwwkcnacojp) Since there are five North Korean mili-tary attacks in our case we pulled 10 tranches of articles from all theKCNA articles published from 1997 to 2013 We then labelled five ofthese tranches lsquothreatrsquo while labelling the other five as lsquonon-threatrsquo All

2 We tried a variety of machine learning algorithms such as Random Forests SupportVector Machines and Conditional Inference Trees Also we cross-validated their resultsOur findings demonstrate that these algorithms produced similar results Moreover all ofthese algorithms identified the similar pattern-detecting terms for classifying the KCNA ar-ticles In this article we reported the results based on the Conditional Inference Tree algo-rithm which ran the tree-structured regression models through binary recursivepartitioning in a conditional inference framework For more details on the ConditionalInference Tree algorithm please see the R Package lsquoPartyrsquo for Conditional Inference Trees(lsquoctreersquo) at httpcranr-projectorgwebpackagespartypartypdf

Detecting patterns in North Korean military provocations 9

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 4: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

11 Logistics data and cases

111 Data KCNA

As the sole news agency of North Korea the KCNA provides daily re-ports of North Korean newspapers (eg Rodong Sinmun of the rulingWPK and Minju Chosun of the North Korean government) televisionbroadcasts (eg Korean Central Television) and radio broadcasts (egthe Korean Central Broadcasting System) Since 1996 the KCNA haspublished daily news articles via its server in Japan (wwwkcnacojp)On a typical day the KCNA publishes 20ndash40 articles including reportson activities of the ruling North Korean elite official statements of theNorth Korean government (eg an official statement from the Ministryof Foreign Affairs on nuclear issues) several articles selected fromRodong Sinmun and Minju Chosun miscellaneous news about NorthKorean society and reports of recent developments in foreigncountries

Scholars working on North Korea have relied on its two mediasRodong Sinmun and the KCNA As the newspaper of the WPKRodong Sinmun is regarded as the official mouthpiece of the NorthKorean regime As a result it has become popular for scholars espe-cially in South Korea to conduct content analyses of Rodong Sinmunin order to identify trends or policy shifts of the North Korean govern-ment (Koh 2013) From our viewpoint however Rodong Sinmun hastwo weaknesses First it is inappropriate for our project because its pri-mary target is the domestic audience of North Korea (thus publishedonly in Korean) Given that Rodong Sinmun is written for a domesticaudience it is not a proper place to look for signs patterns or indica-tors of Pyongyang that its relations with the outside world (especiallythe US and South Korea) are at a breaking point and that a militaryconflict of some sort is about to occur Instead the KCNA with its fo-cus on foreign audiences (thus published in four different languages ndashEnglish Spanish Russian and Korean) provides a better source to de-tect such signs or patterns Second the KCNA provides a better data-set from a technical viewpoint Although North Korea has madeRodong Sinmun available on internet (wwwkcnacojptoday-rodongrodonghtm) in recent years only few selected articles after 2002 are

4 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

available while only titles are provided for the rest of Rodong Sinmunarticles By contrast all the articles in the KCNA after 1996 are avail-able on the internet thus providing better data for our machine-basedtext-classification analysis to detect any signs patterns or indicatorsfrom Pyongyang that a military strike is likely to occur As a resultthe KCNA is used as the main source of data

Although our case selection of conventional North Korean provoca-tions begins in 1997 because KCNA data is available after that year itis more than a technical convenience to use the year 1997 as a startingpoint It also overlaps with the succession process from Kim Il-sung toKim Jong-il When Kim Il-sung passed away on 8 July 1994 KimJong-il took the traditional three-year mourning period (1994ndash96) be-fore he assumed the official title of General Secretary of WPK in 1997to rule the country As a result the year 1997 serves as a starting pointin our project not only for a technical reason (ie the availability of theKCNA dataset after 1996) but also for a substantive reason (ie it in-cluded North Korean military provocations in the post-Kim Il-sungera) In particular Pyongyang has launched five conventional militarystrikes in the post-Kim Il-sung period

12 Cases five conventional military crises since 1997

Figure 1 shows five North Korean conventional military attacks be-tween 1996 and 2013 (i) the First Battle of Yonpyong on 15 June1999 (ii) the Second Battle of Yonpyong on 29 June 2002 (iii) theBattle of Daechong on 10 November 2009 (iv) the sinking of theCheonan naval ship on 26 March 2010 and(v) the shelling atYonpyong Island on 23 November 2010 For our case selection threecriteria are used First each of these attacks caused one or more casu-alties on at least one side Second all five attacks used conventionalnon-nuclear weapons1 Third all five attacks were initiated by North

1 In this article we have excluded all events associated with the North Korean nuclear crisisIn a separate project however we employ a machine-learning technique to develop to de-tect significant signs or patterns of Pyongyang that it is about to conduct a nuclear testPreliminary research shows an interesting contrast between conventional provocations andnuclear crisis in the North Korean case As will be shown a single platform (covering theentire period from Kim Jong-Il to the Kim Jong-Un) outperforms a double platform (onemodel for the KJI period and another model for the KJU era) in the case of North Koreanconventional provocations Simply put there has not been a major policy shift in

Detecting patterns in North Korean military provocations 5

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Korea Below is a brief description of the five North Korean militaryprovocations

121 The first battle of Yonpyong15June1999

In 1999 Pyongyang claimed that South Korea trespassed the NorthernLimit Line (NLL) in the Yellow Sea On 7 June 1999 when NorthKorean patrol ships and fishing boats crossed the NLL the SouthKorean navy responded by increasing its patrol After a few days ofcontinual trespassing the South Korean government issued a warningand deployed two patrol corvettes (Hong 2012) When Pyongyangcontinued to ignore warnings the South Korean navy blocked NorthKorean boats ramming them (Macfie 2013) After a few skirmishesthe main battle occurred on 15 June 1999 when four North Korean pa-trol ships trespassed across the NLL soon joined by three NorthKorean torpedo boats and three additional patrol ships With a totalof ten battleships North Koreans launched a 25 mm cannon shell

Figure 1 Five conventional North Korean military attacks

Pyongyang as far as its conventional provocations are concerned By contrast a doubleplatform outperforms a single platform in the case of the North Korean nuclear crisis indi-cating a major policy shift from Kim Jong-Il to Kim Jong-Un At the moment we are try-ing to find out whether such a shift indicates an increasing intention of the new NorthKorean leadership under Kim Jong-Un to proclaim a lsquonuclear powerrsquo status by developingits nuclear program further instead of negotiating over it as his father had done on severaloccasions

6 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

(Park 2009) In response the South Korean navy fired with 40 mmand 76 mm machine guns When the battle was over a North Koreantorpedo boat sank a large patrol ship crashed and four patrol shipssustained damage In the process approximately 30 North Korean sol-diers were killed and 70 were wounded As for South Korea four pa-trol killers and one patrol corvette were damaged with nine soldierswounded (Moore and Hutchison 2010)

122 The second battle of Yonpyong 29 June 2002

By 2002 North Korean ships frequently crossed the NLL only to bechased back by South Korean patrol vessels On 29 June 2002 twoNorth Korean patrol ships crossed the border ignoring warnings fromSouth Korean navy speedboats At 1025 am the two North Koreanpatrol boats attacked nearby South Korean vessels with its 85 mmgun 25 mm auxiliary gun and hand carried rockets In response theSouth Korean patrol ships returned fire (Sohn 2002) A 20-minute bat-tle resulted in the death of four South Korean marines one missingand 18 wounded On the North Korean side approximately 30 sailorswere killed or injured While South Korean vessels were partly dam-aged one of the North Korean vessels was towed away in flames(Global Security 2002)

123 The battle of daechong 10 November 2009

On 10 November 2009 a North Korean patrol vessel trespassed theNLL Soon after two 130-ton South Korean vessels issued warningsbut the North Korean vessel ignored them When the South Koreanships fired warning shots the North Korean vessel began firing leadingto a 2-minute battle near Daechong Island located 18 miles off theNorth Korean coast The North Korean patrol vessel also attacked aSouth Korean high-speed patrol vessel (Kim 2009) In response theSouth Korean vessel countered with approximately 200 shots Whenthe battle was over there were no South Korean casualties but NorthKorea suffered one casualty and three injuries with its naval vessellsquowrapped in flamesrsquo (Choe 2009)

Detecting patterns in North Korean military provocations 7

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

124 Sinking of the cheonan naval ship 26 March 2010

On 26 March 2010 the South Korean naval ship Cheonan sank intothe Yellow Sea At 922 pm an explosion occurred in the 1200-tonwarship that was sailing by Baengnyong Island where the two Koreashad clashed numerous times before (Cha 2010) In the end 46 liveswere lost and it was quickly suspected that the ship lsquohad been hit byan external lsquonon-contactrsquo explosionrsquo with North Korea as the primesuspect (Sudworth 2010) Pyongyang denied its involvement andclaimed that the incident had been contrived by South Korea A six-week-long investigation by international experts proved the involve-ment of North Korea in the attack

125 Shelling of Yonpyong island 23 November 2010

On 23 November 2010 North Korea fired artillery shells at YonpyongIsland near the NLL About 200 shells hit the island and set fire todozens of buildings The barrage killed two South Korean citizens andtwo marines while injuring three civilians and 17 soldiers The attackbegan when South Korea was practicing military drills near the NLLdespite North Korean warnings North Korea fired three separate bar-rages with dozens of artillery shells in each barrage In return SouthKorea responded by firing 80 rounds from K9 155 mm self-propelledHowitzers (Kim and Kim 2011) When the battle was over more than50 civilian homes were in flames

13 Research method supervised machine learning

Although the North Korean nuclear crisis has been the focus of the in-ternational community in recent years the history of North Koreanmilitary provocations dates much further back For instancePyongyang destabilized an already precarious security environment inthe Korean Peninsula by launching a series of provocations such asthe hijacking of a South Korean airline (1958) the Korean DMZConflict known as lsquothe Second Korean Warrsquo with more than 700 casu-alties (1969) the hijacking of the USS Pueblo (1968) the notoriousAxe Murder Incident (1976) the bombing of the Korean Air Line 858(1987) and so on

8 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Not surprisingly many scholars have attempted to analyze NorthKorean military provocations trying to identify their lsquocausesrsquo (Jung2013 Lee 2014 Ko 2015) main lsquogoalsrsquo in those attacks (Jung 2008Kang 2013) proper lsquopoliciesrsquo to curtail further threats (Oh 2011Kim 2012) and so on Despite sincere efforts earlier studies sufferedfrom one notorious problem a lack of reliable data As one put itNorth Korea has been lsquothe blackest of black holesrsquo (Litwak 2007289) Under such circumstances scholars had little choice but to makeeducated guesses ndash lsquoguesstimationsrsquo ndash regarding unknown intentionsgoals or likely moves of the North Korean regime As a result the ex-isting literature on North Korean military provocations has beendriven less by reliable data but more by subjective interpretations

By contrast we rely on the official North Korean media KCNA Asfor the method we use a supervised machine-learning technology tomaximize the use of the KCNA dataset that covers various aspects ofNorth Korea from 1997 to 2013 The main advantage of our methodcomes from the quality of the data that is our findings are data-driven and thus more objective than previous works relying on subjec-tive interpretation of selected observations Given the paucity ofreliable information on North Korea it is important to analyze thecontent of official KCNA articles that are publicly available A super-vised machine-learning technology is the optimal method to processsuch data objectively

Supervised machine-learning consists of three steps(i) data collec-tion and document labelling (ii) pre-processing and (iii) model extrac-tion and analysis2 First we obtained our dataset from the KCNAwebsite (httpwwwkcnacojp) Since there are five North Korean mili-tary attacks in our case we pulled 10 tranches of articles from all theKCNA articles published from 1997 to 2013 We then labelled five ofthese tranches lsquothreatrsquo while labelling the other five as lsquonon-threatrsquo All

2 We tried a variety of machine learning algorithms such as Random Forests SupportVector Machines and Conditional Inference Trees Also we cross-validated their resultsOur findings demonstrate that these algorithms produced similar results Moreover all ofthese algorithms identified the similar pattern-detecting terms for classifying the KCNA ar-ticles In this article we reported the results based on the Conditional Inference Tree algo-rithm which ran the tree-structured regression models through binary recursivepartitioning in a conditional inference framework For more details on the ConditionalInference Tree algorithm please see the R Package lsquoPartyrsquo for Conditional Inference Trees(lsquoctreersquo) at httpcranr-projectorgwebpackagespartypartypdf

Detecting patterns in North Korean military provocations 9

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 5: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

available while only titles are provided for the rest of Rodong Sinmunarticles By contrast all the articles in the KCNA after 1996 are avail-able on the internet thus providing better data for our machine-basedtext-classification analysis to detect any signs patterns or indicatorsfrom Pyongyang that a military strike is likely to occur As a resultthe KCNA is used as the main source of data

Although our case selection of conventional North Korean provoca-tions begins in 1997 because KCNA data is available after that year itis more than a technical convenience to use the year 1997 as a startingpoint It also overlaps with the succession process from Kim Il-sung toKim Jong-il When Kim Il-sung passed away on 8 July 1994 KimJong-il took the traditional three-year mourning period (1994ndash96) be-fore he assumed the official title of General Secretary of WPK in 1997to rule the country As a result the year 1997 serves as a starting pointin our project not only for a technical reason (ie the availability of theKCNA dataset after 1996) but also for a substantive reason (ie it in-cluded North Korean military provocations in the post-Kim Il-sungera) In particular Pyongyang has launched five conventional militarystrikes in the post-Kim Il-sung period

12 Cases five conventional military crises since 1997

Figure 1 shows five North Korean conventional military attacks be-tween 1996 and 2013 (i) the First Battle of Yonpyong on 15 June1999 (ii) the Second Battle of Yonpyong on 29 June 2002 (iii) theBattle of Daechong on 10 November 2009 (iv) the sinking of theCheonan naval ship on 26 March 2010 and(v) the shelling atYonpyong Island on 23 November 2010 For our case selection threecriteria are used First each of these attacks caused one or more casu-alties on at least one side Second all five attacks used conventionalnon-nuclear weapons1 Third all five attacks were initiated by North

1 In this article we have excluded all events associated with the North Korean nuclear crisisIn a separate project however we employ a machine-learning technique to develop to de-tect significant signs or patterns of Pyongyang that it is about to conduct a nuclear testPreliminary research shows an interesting contrast between conventional provocations andnuclear crisis in the North Korean case As will be shown a single platform (covering theentire period from Kim Jong-Il to the Kim Jong-Un) outperforms a double platform (onemodel for the KJI period and another model for the KJU era) in the case of North Koreanconventional provocations Simply put there has not been a major policy shift in

Detecting patterns in North Korean military provocations 5

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Korea Below is a brief description of the five North Korean militaryprovocations

121 The first battle of Yonpyong15June1999

In 1999 Pyongyang claimed that South Korea trespassed the NorthernLimit Line (NLL) in the Yellow Sea On 7 June 1999 when NorthKorean patrol ships and fishing boats crossed the NLL the SouthKorean navy responded by increasing its patrol After a few days ofcontinual trespassing the South Korean government issued a warningand deployed two patrol corvettes (Hong 2012) When Pyongyangcontinued to ignore warnings the South Korean navy blocked NorthKorean boats ramming them (Macfie 2013) After a few skirmishesthe main battle occurred on 15 June 1999 when four North Korean pa-trol ships trespassed across the NLL soon joined by three NorthKorean torpedo boats and three additional patrol ships With a totalof ten battleships North Koreans launched a 25 mm cannon shell

Figure 1 Five conventional North Korean military attacks

Pyongyang as far as its conventional provocations are concerned By contrast a doubleplatform outperforms a single platform in the case of the North Korean nuclear crisis indi-cating a major policy shift from Kim Jong-Il to Kim Jong-Un At the moment we are try-ing to find out whether such a shift indicates an increasing intention of the new NorthKorean leadership under Kim Jong-Un to proclaim a lsquonuclear powerrsquo status by developingits nuclear program further instead of negotiating over it as his father had done on severaloccasions

6 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

(Park 2009) In response the South Korean navy fired with 40 mmand 76 mm machine guns When the battle was over a North Koreantorpedo boat sank a large patrol ship crashed and four patrol shipssustained damage In the process approximately 30 North Korean sol-diers were killed and 70 were wounded As for South Korea four pa-trol killers and one patrol corvette were damaged with nine soldierswounded (Moore and Hutchison 2010)

122 The second battle of Yonpyong 29 June 2002

By 2002 North Korean ships frequently crossed the NLL only to bechased back by South Korean patrol vessels On 29 June 2002 twoNorth Korean patrol ships crossed the border ignoring warnings fromSouth Korean navy speedboats At 1025 am the two North Koreanpatrol boats attacked nearby South Korean vessels with its 85 mmgun 25 mm auxiliary gun and hand carried rockets In response theSouth Korean patrol ships returned fire (Sohn 2002) A 20-minute bat-tle resulted in the death of four South Korean marines one missingand 18 wounded On the North Korean side approximately 30 sailorswere killed or injured While South Korean vessels were partly dam-aged one of the North Korean vessels was towed away in flames(Global Security 2002)

123 The battle of daechong 10 November 2009

On 10 November 2009 a North Korean patrol vessel trespassed theNLL Soon after two 130-ton South Korean vessels issued warningsbut the North Korean vessel ignored them When the South Koreanships fired warning shots the North Korean vessel began firing leadingto a 2-minute battle near Daechong Island located 18 miles off theNorth Korean coast The North Korean patrol vessel also attacked aSouth Korean high-speed patrol vessel (Kim 2009) In response theSouth Korean vessel countered with approximately 200 shots Whenthe battle was over there were no South Korean casualties but NorthKorea suffered one casualty and three injuries with its naval vessellsquowrapped in flamesrsquo (Choe 2009)

Detecting patterns in North Korean military provocations 7

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

124 Sinking of the cheonan naval ship 26 March 2010

On 26 March 2010 the South Korean naval ship Cheonan sank intothe Yellow Sea At 922 pm an explosion occurred in the 1200-tonwarship that was sailing by Baengnyong Island where the two Koreashad clashed numerous times before (Cha 2010) In the end 46 liveswere lost and it was quickly suspected that the ship lsquohad been hit byan external lsquonon-contactrsquo explosionrsquo with North Korea as the primesuspect (Sudworth 2010) Pyongyang denied its involvement andclaimed that the incident had been contrived by South Korea A six-week-long investigation by international experts proved the involve-ment of North Korea in the attack

125 Shelling of Yonpyong island 23 November 2010

On 23 November 2010 North Korea fired artillery shells at YonpyongIsland near the NLL About 200 shells hit the island and set fire todozens of buildings The barrage killed two South Korean citizens andtwo marines while injuring three civilians and 17 soldiers The attackbegan when South Korea was practicing military drills near the NLLdespite North Korean warnings North Korea fired three separate bar-rages with dozens of artillery shells in each barrage In return SouthKorea responded by firing 80 rounds from K9 155 mm self-propelledHowitzers (Kim and Kim 2011) When the battle was over more than50 civilian homes were in flames

13 Research method supervised machine learning

Although the North Korean nuclear crisis has been the focus of the in-ternational community in recent years the history of North Koreanmilitary provocations dates much further back For instancePyongyang destabilized an already precarious security environment inthe Korean Peninsula by launching a series of provocations such asthe hijacking of a South Korean airline (1958) the Korean DMZConflict known as lsquothe Second Korean Warrsquo with more than 700 casu-alties (1969) the hijacking of the USS Pueblo (1968) the notoriousAxe Murder Incident (1976) the bombing of the Korean Air Line 858(1987) and so on

8 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Not surprisingly many scholars have attempted to analyze NorthKorean military provocations trying to identify their lsquocausesrsquo (Jung2013 Lee 2014 Ko 2015) main lsquogoalsrsquo in those attacks (Jung 2008Kang 2013) proper lsquopoliciesrsquo to curtail further threats (Oh 2011Kim 2012) and so on Despite sincere efforts earlier studies sufferedfrom one notorious problem a lack of reliable data As one put itNorth Korea has been lsquothe blackest of black holesrsquo (Litwak 2007289) Under such circumstances scholars had little choice but to makeeducated guesses ndash lsquoguesstimationsrsquo ndash regarding unknown intentionsgoals or likely moves of the North Korean regime As a result the ex-isting literature on North Korean military provocations has beendriven less by reliable data but more by subjective interpretations

By contrast we rely on the official North Korean media KCNA Asfor the method we use a supervised machine-learning technology tomaximize the use of the KCNA dataset that covers various aspects ofNorth Korea from 1997 to 2013 The main advantage of our methodcomes from the quality of the data that is our findings are data-driven and thus more objective than previous works relying on subjec-tive interpretation of selected observations Given the paucity ofreliable information on North Korea it is important to analyze thecontent of official KCNA articles that are publicly available A super-vised machine-learning technology is the optimal method to processsuch data objectively

Supervised machine-learning consists of three steps(i) data collec-tion and document labelling (ii) pre-processing and (iii) model extrac-tion and analysis2 First we obtained our dataset from the KCNAwebsite (httpwwwkcnacojp) Since there are five North Korean mili-tary attacks in our case we pulled 10 tranches of articles from all theKCNA articles published from 1997 to 2013 We then labelled five ofthese tranches lsquothreatrsquo while labelling the other five as lsquonon-threatrsquo All

2 We tried a variety of machine learning algorithms such as Random Forests SupportVector Machines and Conditional Inference Trees Also we cross-validated their resultsOur findings demonstrate that these algorithms produced similar results Moreover all ofthese algorithms identified the similar pattern-detecting terms for classifying the KCNA ar-ticles In this article we reported the results based on the Conditional Inference Tree algo-rithm which ran the tree-structured regression models through binary recursivepartitioning in a conditional inference framework For more details on the ConditionalInference Tree algorithm please see the R Package lsquoPartyrsquo for Conditional Inference Trees(lsquoctreersquo) at httpcranr-projectorgwebpackagespartypartypdf

Detecting patterns in North Korean military provocations 9

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 6: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

Korea Below is a brief description of the five North Korean militaryprovocations

121 The first battle of Yonpyong15June1999

In 1999 Pyongyang claimed that South Korea trespassed the NorthernLimit Line (NLL) in the Yellow Sea On 7 June 1999 when NorthKorean patrol ships and fishing boats crossed the NLL the SouthKorean navy responded by increasing its patrol After a few days ofcontinual trespassing the South Korean government issued a warningand deployed two patrol corvettes (Hong 2012) When Pyongyangcontinued to ignore warnings the South Korean navy blocked NorthKorean boats ramming them (Macfie 2013) After a few skirmishesthe main battle occurred on 15 June 1999 when four North Korean pa-trol ships trespassed across the NLL soon joined by three NorthKorean torpedo boats and three additional patrol ships With a totalof ten battleships North Koreans launched a 25 mm cannon shell

Figure 1 Five conventional North Korean military attacks

Pyongyang as far as its conventional provocations are concerned By contrast a doubleplatform outperforms a single platform in the case of the North Korean nuclear crisis indi-cating a major policy shift from Kim Jong-Il to Kim Jong-Un At the moment we are try-ing to find out whether such a shift indicates an increasing intention of the new NorthKorean leadership under Kim Jong-Un to proclaim a lsquonuclear powerrsquo status by developingits nuclear program further instead of negotiating over it as his father had done on severaloccasions

6 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

(Park 2009) In response the South Korean navy fired with 40 mmand 76 mm machine guns When the battle was over a North Koreantorpedo boat sank a large patrol ship crashed and four patrol shipssustained damage In the process approximately 30 North Korean sol-diers were killed and 70 were wounded As for South Korea four pa-trol killers and one patrol corvette were damaged with nine soldierswounded (Moore and Hutchison 2010)

122 The second battle of Yonpyong 29 June 2002

By 2002 North Korean ships frequently crossed the NLL only to bechased back by South Korean patrol vessels On 29 June 2002 twoNorth Korean patrol ships crossed the border ignoring warnings fromSouth Korean navy speedboats At 1025 am the two North Koreanpatrol boats attacked nearby South Korean vessels with its 85 mmgun 25 mm auxiliary gun and hand carried rockets In response theSouth Korean patrol ships returned fire (Sohn 2002) A 20-minute bat-tle resulted in the death of four South Korean marines one missingand 18 wounded On the North Korean side approximately 30 sailorswere killed or injured While South Korean vessels were partly dam-aged one of the North Korean vessels was towed away in flames(Global Security 2002)

123 The battle of daechong 10 November 2009

On 10 November 2009 a North Korean patrol vessel trespassed theNLL Soon after two 130-ton South Korean vessels issued warningsbut the North Korean vessel ignored them When the South Koreanships fired warning shots the North Korean vessel began firing leadingto a 2-minute battle near Daechong Island located 18 miles off theNorth Korean coast The North Korean patrol vessel also attacked aSouth Korean high-speed patrol vessel (Kim 2009) In response theSouth Korean vessel countered with approximately 200 shots Whenthe battle was over there were no South Korean casualties but NorthKorea suffered one casualty and three injuries with its naval vessellsquowrapped in flamesrsquo (Choe 2009)

Detecting patterns in North Korean military provocations 7

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

124 Sinking of the cheonan naval ship 26 March 2010

On 26 March 2010 the South Korean naval ship Cheonan sank intothe Yellow Sea At 922 pm an explosion occurred in the 1200-tonwarship that was sailing by Baengnyong Island where the two Koreashad clashed numerous times before (Cha 2010) In the end 46 liveswere lost and it was quickly suspected that the ship lsquohad been hit byan external lsquonon-contactrsquo explosionrsquo with North Korea as the primesuspect (Sudworth 2010) Pyongyang denied its involvement andclaimed that the incident had been contrived by South Korea A six-week-long investigation by international experts proved the involve-ment of North Korea in the attack

125 Shelling of Yonpyong island 23 November 2010

On 23 November 2010 North Korea fired artillery shells at YonpyongIsland near the NLL About 200 shells hit the island and set fire todozens of buildings The barrage killed two South Korean citizens andtwo marines while injuring three civilians and 17 soldiers The attackbegan when South Korea was practicing military drills near the NLLdespite North Korean warnings North Korea fired three separate bar-rages with dozens of artillery shells in each barrage In return SouthKorea responded by firing 80 rounds from K9 155 mm self-propelledHowitzers (Kim and Kim 2011) When the battle was over more than50 civilian homes were in flames

13 Research method supervised machine learning

Although the North Korean nuclear crisis has been the focus of the in-ternational community in recent years the history of North Koreanmilitary provocations dates much further back For instancePyongyang destabilized an already precarious security environment inthe Korean Peninsula by launching a series of provocations such asthe hijacking of a South Korean airline (1958) the Korean DMZConflict known as lsquothe Second Korean Warrsquo with more than 700 casu-alties (1969) the hijacking of the USS Pueblo (1968) the notoriousAxe Murder Incident (1976) the bombing of the Korean Air Line 858(1987) and so on

8 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Not surprisingly many scholars have attempted to analyze NorthKorean military provocations trying to identify their lsquocausesrsquo (Jung2013 Lee 2014 Ko 2015) main lsquogoalsrsquo in those attacks (Jung 2008Kang 2013) proper lsquopoliciesrsquo to curtail further threats (Oh 2011Kim 2012) and so on Despite sincere efforts earlier studies sufferedfrom one notorious problem a lack of reliable data As one put itNorth Korea has been lsquothe blackest of black holesrsquo (Litwak 2007289) Under such circumstances scholars had little choice but to makeeducated guesses ndash lsquoguesstimationsrsquo ndash regarding unknown intentionsgoals or likely moves of the North Korean regime As a result the ex-isting literature on North Korean military provocations has beendriven less by reliable data but more by subjective interpretations

By contrast we rely on the official North Korean media KCNA Asfor the method we use a supervised machine-learning technology tomaximize the use of the KCNA dataset that covers various aspects ofNorth Korea from 1997 to 2013 The main advantage of our methodcomes from the quality of the data that is our findings are data-driven and thus more objective than previous works relying on subjec-tive interpretation of selected observations Given the paucity ofreliable information on North Korea it is important to analyze thecontent of official KCNA articles that are publicly available A super-vised machine-learning technology is the optimal method to processsuch data objectively

Supervised machine-learning consists of three steps(i) data collec-tion and document labelling (ii) pre-processing and (iii) model extrac-tion and analysis2 First we obtained our dataset from the KCNAwebsite (httpwwwkcnacojp) Since there are five North Korean mili-tary attacks in our case we pulled 10 tranches of articles from all theKCNA articles published from 1997 to 2013 We then labelled five ofthese tranches lsquothreatrsquo while labelling the other five as lsquonon-threatrsquo All

2 We tried a variety of machine learning algorithms such as Random Forests SupportVector Machines and Conditional Inference Trees Also we cross-validated their resultsOur findings demonstrate that these algorithms produced similar results Moreover all ofthese algorithms identified the similar pattern-detecting terms for classifying the KCNA ar-ticles In this article we reported the results based on the Conditional Inference Tree algo-rithm which ran the tree-structured regression models through binary recursivepartitioning in a conditional inference framework For more details on the ConditionalInference Tree algorithm please see the R Package lsquoPartyrsquo for Conditional Inference Trees(lsquoctreersquo) at httpcranr-projectorgwebpackagespartypartypdf

Detecting patterns in North Korean military provocations 9

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 7: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

(Park 2009) In response the South Korean navy fired with 40 mmand 76 mm machine guns When the battle was over a North Koreantorpedo boat sank a large patrol ship crashed and four patrol shipssustained damage In the process approximately 30 North Korean sol-diers were killed and 70 were wounded As for South Korea four pa-trol killers and one patrol corvette were damaged with nine soldierswounded (Moore and Hutchison 2010)

122 The second battle of Yonpyong 29 June 2002

By 2002 North Korean ships frequently crossed the NLL only to bechased back by South Korean patrol vessels On 29 June 2002 twoNorth Korean patrol ships crossed the border ignoring warnings fromSouth Korean navy speedboats At 1025 am the two North Koreanpatrol boats attacked nearby South Korean vessels with its 85 mmgun 25 mm auxiliary gun and hand carried rockets In response theSouth Korean patrol ships returned fire (Sohn 2002) A 20-minute bat-tle resulted in the death of four South Korean marines one missingand 18 wounded On the North Korean side approximately 30 sailorswere killed or injured While South Korean vessels were partly dam-aged one of the North Korean vessels was towed away in flames(Global Security 2002)

123 The battle of daechong 10 November 2009

On 10 November 2009 a North Korean patrol vessel trespassed theNLL Soon after two 130-ton South Korean vessels issued warningsbut the North Korean vessel ignored them When the South Koreanships fired warning shots the North Korean vessel began firing leadingto a 2-minute battle near Daechong Island located 18 miles off theNorth Korean coast The North Korean patrol vessel also attacked aSouth Korean high-speed patrol vessel (Kim 2009) In response theSouth Korean vessel countered with approximately 200 shots Whenthe battle was over there were no South Korean casualties but NorthKorea suffered one casualty and three injuries with its naval vessellsquowrapped in flamesrsquo (Choe 2009)

Detecting patterns in North Korean military provocations 7

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

124 Sinking of the cheonan naval ship 26 March 2010

On 26 March 2010 the South Korean naval ship Cheonan sank intothe Yellow Sea At 922 pm an explosion occurred in the 1200-tonwarship that was sailing by Baengnyong Island where the two Koreashad clashed numerous times before (Cha 2010) In the end 46 liveswere lost and it was quickly suspected that the ship lsquohad been hit byan external lsquonon-contactrsquo explosionrsquo with North Korea as the primesuspect (Sudworth 2010) Pyongyang denied its involvement andclaimed that the incident had been contrived by South Korea A six-week-long investigation by international experts proved the involve-ment of North Korea in the attack

125 Shelling of Yonpyong island 23 November 2010

On 23 November 2010 North Korea fired artillery shells at YonpyongIsland near the NLL About 200 shells hit the island and set fire todozens of buildings The barrage killed two South Korean citizens andtwo marines while injuring three civilians and 17 soldiers The attackbegan when South Korea was practicing military drills near the NLLdespite North Korean warnings North Korea fired three separate bar-rages with dozens of artillery shells in each barrage In return SouthKorea responded by firing 80 rounds from K9 155 mm self-propelledHowitzers (Kim and Kim 2011) When the battle was over more than50 civilian homes were in flames

13 Research method supervised machine learning

Although the North Korean nuclear crisis has been the focus of the in-ternational community in recent years the history of North Koreanmilitary provocations dates much further back For instancePyongyang destabilized an already precarious security environment inthe Korean Peninsula by launching a series of provocations such asthe hijacking of a South Korean airline (1958) the Korean DMZConflict known as lsquothe Second Korean Warrsquo with more than 700 casu-alties (1969) the hijacking of the USS Pueblo (1968) the notoriousAxe Murder Incident (1976) the bombing of the Korean Air Line 858(1987) and so on

8 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Not surprisingly many scholars have attempted to analyze NorthKorean military provocations trying to identify their lsquocausesrsquo (Jung2013 Lee 2014 Ko 2015) main lsquogoalsrsquo in those attacks (Jung 2008Kang 2013) proper lsquopoliciesrsquo to curtail further threats (Oh 2011Kim 2012) and so on Despite sincere efforts earlier studies sufferedfrom one notorious problem a lack of reliable data As one put itNorth Korea has been lsquothe blackest of black holesrsquo (Litwak 2007289) Under such circumstances scholars had little choice but to makeeducated guesses ndash lsquoguesstimationsrsquo ndash regarding unknown intentionsgoals or likely moves of the North Korean regime As a result the ex-isting literature on North Korean military provocations has beendriven less by reliable data but more by subjective interpretations

By contrast we rely on the official North Korean media KCNA Asfor the method we use a supervised machine-learning technology tomaximize the use of the KCNA dataset that covers various aspects ofNorth Korea from 1997 to 2013 The main advantage of our methodcomes from the quality of the data that is our findings are data-driven and thus more objective than previous works relying on subjec-tive interpretation of selected observations Given the paucity ofreliable information on North Korea it is important to analyze thecontent of official KCNA articles that are publicly available A super-vised machine-learning technology is the optimal method to processsuch data objectively

Supervised machine-learning consists of three steps(i) data collec-tion and document labelling (ii) pre-processing and (iii) model extrac-tion and analysis2 First we obtained our dataset from the KCNAwebsite (httpwwwkcnacojp) Since there are five North Korean mili-tary attacks in our case we pulled 10 tranches of articles from all theKCNA articles published from 1997 to 2013 We then labelled five ofthese tranches lsquothreatrsquo while labelling the other five as lsquonon-threatrsquo All

2 We tried a variety of machine learning algorithms such as Random Forests SupportVector Machines and Conditional Inference Trees Also we cross-validated their resultsOur findings demonstrate that these algorithms produced similar results Moreover all ofthese algorithms identified the similar pattern-detecting terms for classifying the KCNA ar-ticles In this article we reported the results based on the Conditional Inference Tree algo-rithm which ran the tree-structured regression models through binary recursivepartitioning in a conditional inference framework For more details on the ConditionalInference Tree algorithm please see the R Package lsquoPartyrsquo for Conditional Inference Trees(lsquoctreersquo) at httpcranr-projectorgwebpackagespartypartypdf

Detecting patterns in North Korean military provocations 9

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 8: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

124 Sinking of the cheonan naval ship 26 March 2010

On 26 March 2010 the South Korean naval ship Cheonan sank intothe Yellow Sea At 922 pm an explosion occurred in the 1200-tonwarship that was sailing by Baengnyong Island where the two Koreashad clashed numerous times before (Cha 2010) In the end 46 liveswere lost and it was quickly suspected that the ship lsquohad been hit byan external lsquonon-contactrsquo explosionrsquo with North Korea as the primesuspect (Sudworth 2010) Pyongyang denied its involvement andclaimed that the incident had been contrived by South Korea A six-week-long investigation by international experts proved the involve-ment of North Korea in the attack

125 Shelling of Yonpyong island 23 November 2010

On 23 November 2010 North Korea fired artillery shells at YonpyongIsland near the NLL About 200 shells hit the island and set fire todozens of buildings The barrage killed two South Korean citizens andtwo marines while injuring three civilians and 17 soldiers The attackbegan when South Korea was practicing military drills near the NLLdespite North Korean warnings North Korea fired three separate bar-rages with dozens of artillery shells in each barrage In return SouthKorea responded by firing 80 rounds from K9 155 mm self-propelledHowitzers (Kim and Kim 2011) When the battle was over more than50 civilian homes were in flames

13 Research method supervised machine learning

Although the North Korean nuclear crisis has been the focus of the in-ternational community in recent years the history of North Koreanmilitary provocations dates much further back For instancePyongyang destabilized an already precarious security environment inthe Korean Peninsula by launching a series of provocations such asthe hijacking of a South Korean airline (1958) the Korean DMZConflict known as lsquothe Second Korean Warrsquo with more than 700 casu-alties (1969) the hijacking of the USS Pueblo (1968) the notoriousAxe Murder Incident (1976) the bombing of the Korean Air Line 858(1987) and so on

8 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Not surprisingly many scholars have attempted to analyze NorthKorean military provocations trying to identify their lsquocausesrsquo (Jung2013 Lee 2014 Ko 2015) main lsquogoalsrsquo in those attacks (Jung 2008Kang 2013) proper lsquopoliciesrsquo to curtail further threats (Oh 2011Kim 2012) and so on Despite sincere efforts earlier studies sufferedfrom one notorious problem a lack of reliable data As one put itNorth Korea has been lsquothe blackest of black holesrsquo (Litwak 2007289) Under such circumstances scholars had little choice but to makeeducated guesses ndash lsquoguesstimationsrsquo ndash regarding unknown intentionsgoals or likely moves of the North Korean regime As a result the ex-isting literature on North Korean military provocations has beendriven less by reliable data but more by subjective interpretations

By contrast we rely on the official North Korean media KCNA Asfor the method we use a supervised machine-learning technology tomaximize the use of the KCNA dataset that covers various aspects ofNorth Korea from 1997 to 2013 The main advantage of our methodcomes from the quality of the data that is our findings are data-driven and thus more objective than previous works relying on subjec-tive interpretation of selected observations Given the paucity ofreliable information on North Korea it is important to analyze thecontent of official KCNA articles that are publicly available A super-vised machine-learning technology is the optimal method to processsuch data objectively

Supervised machine-learning consists of three steps(i) data collec-tion and document labelling (ii) pre-processing and (iii) model extrac-tion and analysis2 First we obtained our dataset from the KCNAwebsite (httpwwwkcnacojp) Since there are five North Korean mili-tary attacks in our case we pulled 10 tranches of articles from all theKCNA articles published from 1997 to 2013 We then labelled five ofthese tranches lsquothreatrsquo while labelling the other five as lsquonon-threatrsquo All

2 We tried a variety of machine learning algorithms such as Random Forests SupportVector Machines and Conditional Inference Trees Also we cross-validated their resultsOur findings demonstrate that these algorithms produced similar results Moreover all ofthese algorithms identified the similar pattern-detecting terms for classifying the KCNA ar-ticles In this article we reported the results based on the Conditional Inference Tree algo-rithm which ran the tree-structured regression models through binary recursivepartitioning in a conditional inference framework For more details on the ConditionalInference Tree algorithm please see the R Package lsquoPartyrsquo for Conditional Inference Trees(lsquoctreersquo) at httpcranr-projectorgwebpackagespartypartypdf

Detecting patterns in North Korean military provocations 9

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 9: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

Not surprisingly many scholars have attempted to analyze NorthKorean military provocations trying to identify their lsquocausesrsquo (Jung2013 Lee 2014 Ko 2015) main lsquogoalsrsquo in those attacks (Jung 2008Kang 2013) proper lsquopoliciesrsquo to curtail further threats (Oh 2011Kim 2012) and so on Despite sincere efforts earlier studies sufferedfrom one notorious problem a lack of reliable data As one put itNorth Korea has been lsquothe blackest of black holesrsquo (Litwak 2007289) Under such circumstances scholars had little choice but to makeeducated guesses ndash lsquoguesstimationsrsquo ndash regarding unknown intentionsgoals or likely moves of the North Korean regime As a result the ex-isting literature on North Korean military provocations has beendriven less by reliable data but more by subjective interpretations

By contrast we rely on the official North Korean media KCNA Asfor the method we use a supervised machine-learning technology tomaximize the use of the KCNA dataset that covers various aspects ofNorth Korea from 1997 to 2013 The main advantage of our methodcomes from the quality of the data that is our findings are data-driven and thus more objective than previous works relying on subjec-tive interpretation of selected observations Given the paucity ofreliable information on North Korea it is important to analyze thecontent of official KCNA articles that are publicly available A super-vised machine-learning technology is the optimal method to processsuch data objectively

Supervised machine-learning consists of three steps(i) data collec-tion and document labelling (ii) pre-processing and (iii) model extrac-tion and analysis2 First we obtained our dataset from the KCNAwebsite (httpwwwkcnacojp) Since there are five North Korean mili-tary attacks in our case we pulled 10 tranches of articles from all theKCNA articles published from 1997 to 2013 We then labelled five ofthese tranches lsquothreatrsquo while labelling the other five as lsquonon-threatrsquo All

2 We tried a variety of machine learning algorithms such as Random Forests SupportVector Machines and Conditional Inference Trees Also we cross-validated their resultsOur findings demonstrate that these algorithms produced similar results Moreover all ofthese algorithms identified the similar pattern-detecting terms for classifying the KCNA ar-ticles In this article we reported the results based on the Conditional Inference Tree algo-rithm which ran the tree-structured regression models through binary recursivepartitioning in a conditional inference framework For more details on the ConditionalInference Tree algorithm please see the R Package lsquoPartyrsquo for Conditional Inference Trees(lsquoctreersquo) at httpcranr-projectorgwebpackagespartypartypdf

Detecting patterns in North Korean military provocations 9

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 10: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

the articles published within a week preceding each attack is labelled aslsquothreatrsquo and the rest of KCNA articles were treated as lsquonon-threatrsquo Intotal there were 1624 KCNA articles published during this periodwith 487 labelled as lsquothreatrsquo and 1137 labelled as lsquonon-threatrsquoFor the threat articles we extracted all the articles published within aweek of each North Korean military provocation (Fig 2) For the non-threat articles we selected tranches of news articles for a randomlychosen 10-day period from at least two months before or after a NorthKorean attack In so doing our goal was to capture articles that wereunrelated to a North Korean attack thus labelled as lsquonon-threatrsquo

Second the selected dataset is then lsquopreprocessedrsquo Since the originalarticles are not ready for an automated text analysis a cleaning processcalled lsquopreprocessingrsquo is necessary to prepare them for machine-learning Pre-processing is the standard procedure before the applica-tion of an automated text analysis Following the standard preprocess-ing procedure we remove all numbers and stopwords (eg a the into etc)3 We then transformed the body of the text into a documentterm matrix that was composed of rows of news articles followed bycolumns of terms The document term matrix turned preprocessed texts

Non-Threat

-2 +2

Acl

All Threacollected eading up

All Nowere ptwo mo

Threat

Provocation

at-labeledfrom the p to a pro

on-Threatpulled twoonths afte

d articles wseven day

ovocation

t-labeled o months er a provo

ere ays

articles before or

ocation

Non-

r

Threat

Time

Figure 2 Labelling threat and non-threat articles

3 In this article we use one of the most popular preprocessing methods known as the lsquobag ofwordsrsquo concept (Jurafsky and Martin 2009) It discards the use of word order as a factorand removes the so-called lsquostopwordsrsquo from the data The stopwords include punctuationcapitalization common words (eg at an the etc) and numbers For a full list of thestopwords used in our research please visit httpjmlrcsailmitedupapersvolume5lewis04aa11-smart-stop-listenglishstop Although preprocessing reduces the amount of infor-mation it has been shown by researchers that a simplification of text via preprocessing issufficient to infer valuable models (Hopkins and King 2010)

10 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 11: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

into quantifiable values (eg term counts and frequencies) for each cor-responding article

Finally once we had collected and preprocessed the KCNA data wesplit the whole data into two subsets a training dataset and a testdataset From the complete set of articles from both threat and non-threat tranches (1624 articles in total) we randomly selected 70(1137 articles) for the training dataset (Fig 3) The labels lsquothreatrsquo orlsquonon-threatrsquo for each article were included in the training dataset forthe purpose of automated machine-learning that is to develop amodel that can select pattern-detecting features based on a priori clas-sifications Basically the key pattern was frequency of occurrence Thefrequency with which certain words and short phrases appeared inthreat or non-threat articles determined how they were selected andweighted as pattern-detecting features in the model The trainedmodel was then applied to the remaining test dataset which was usedsolely for testing the accuracy of our model Importantly we removedall the threat and non-threat labels from each article in the testdataset The purpose for doing so was to challenge our model to usewhat it had learned (from the training dataset) about significant fea-tures (ie a term frequency rate) to accurately classify KCNA articles(in the test dataset) as either a threat or a non-threat without relyingon labels4

All Data 1624 Articles

1137 Articles 487

Training DataThreat Labels Provided

Test DataThreat Labels Removed

Figure 3 Training and test data sets

4 For replication our data is available at httpsdataverseharvardedudatasetxhtmlpersistentIdfrac14doi3A1079102FDVN2FB8CWWD

Detecting patterns in North Korean military provocations 11

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 12: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

2 Results

21 Training data results

Figure 4 shows the model we obtained from the training dataset usinga supervised machine-learning The key terms distinguishing NorthKorean threats from a peaceful period were lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquolsquoJunersquo and lsquoJapanesersquo When certain conditions were satisfied (as ex-plained below) the appearance of these terms in KCNA articles couldplay the role of a pattern-indicator that a North Korean military threatwas on the horizon

According to our model the first and strongest indicator of a NorthKorean military provocation is the appearance of the term lsquoyearsrsquo Inthe training dataset all KCNA articles in which the word lsquoyearsrsquo ap-peared at least once (53 articles in total) were published within a weekof a North Korean military strike From our viewpoint the most reli-able sign of an imminent North Korean military provocation is a sud-den increase in the frequency of the term lsquoyearsrsquo in KCNA articles

When the term lsquoyearsrsquo is absent the second best indicator of a NorthKorean military attack is the word lsquosignedrsquo Approximately 85 ofKCNA articles (51 articles in total) that had the term lsquosignedrsquo withoutthe word lsquobakrsquo (from former South Korean president Lee Myung-bak)

Figure 4 First conditional inference tree

12 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 13: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

were published within a week of a North Korea military provocationAs a result a sudden spike in the use of the word lsquosignedrsquo (withoutlsquobakrsquo) indicates that a North Korean military provocation is around thecorner In this respect however there is a further condition that mustbe satisfied When lsquosignedrsquo and lsquobakrsquo appear together in the same arti-cle they indicate a pattern of a peaceful situation instead As a resultthe pattern-detecting role of the term lsquosignedrsquo appears to be contingentupon the presence or absence of another term lsquobakrsquo

According to our model the third index indicating that Pyongyangmay be preparing for a military operation is the term lsquoassemblyrsquo Infact all the KCNA articles that included the word lsquoassemblyrsquo withoutlsquoyearsrsquo or lsquosignedrsquo (22 articles in the training dataset) were publishedwithin one week of a North Korean military attack As a result a sud-den increase in the frequency of the word lsquoassemblyrsquo may be a goodpattern indicator that a North Korean military strike is on the horizoneven in the absence of stronger threat terms such as lsquoyearsrsquo andlsquosignedrsquo

The fourth indicator of a North Korean military threat is the termlsquoJunersquo Unlike other indicators however the role of lsquoJunersquo is condi-tional In the absence of other stronger signs (ie years signed and as-sembly) the term lsquoJunersquo can play the role of a significant indicator of aNorth Korean military attack only if another term lsquoreunificationrsquo ap-pears once or less in the same article To our surprise however theterm lsquoJunersquo turns into a strong indicator of a non-threat situation if theword lsquoreunificationrsquo appears more than twice in the same article As aresult it seems that the word lsquoJunersquo implies an increasing NorthKorean threat only if it is not closely associated with another key pat-tern indicator lsquoreunificationrsquo

The last index of a North Korean military provocation is the wordlsquoJapanesersquo When other (and stronger) terms such as lsquoyearsrsquo lsquosignedrsquolsquoassemblyrsquo and lsquoJunersquo were not present all the KCNA articles that in-cluded the term lsquoJapanesersquo (17 articles in the training dataset) appearedwithin one week of a North Korean military provocation As a resultthe use of the term lsquoJapanesersquo serves as a good pattern indication thatin the absence of other attack words Pyongyang is moving dangerouslyclose to a military strike

Detecting patterns in North Korean military provocations 13

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 14: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

22 Testing the model

There are 904 KCNA articles that do not include any of the indicatorsdiscussed above Because they do not contain any pattern indicator ofa North Korean attack our model identifies them as non-threats Inreality however about 20 of these articles were published within oneweek of the five military provocations by Pyongyang As a result ourmodel has roughly 80 accuracy in identifying North Korean militarythreats What is encouraging is that our model has identified five keypattern indicators of increasing North Korean threats When these keyterms are put together under certain conditions they correctly identify80 of articles in the training dataset as threat articles or non-threatitems Put differently the model can accurately classify in 8 of 10 caseswhether various messages from Pyongyang are real threats or justrhetoric

Although 80 accuracy is impressive it is too early to be optimisticAfter all 80 accuracy was achieved with the training dataset fromwhich our model was originally derived If we were impressed by its80 accuracy rate we would resemble a case study specialist who de-veloped an elaborate theory from a few cases and then applied it backto those original cases only to be impressed by how accurate hishertheory was The real test consists of applying our model to cases otherthan those from which it is derived This is the reason why we dividedthe entire KCNA data into two subsets the training dataset fromwhich our model is developed and the test dataset to which it will beapplied as a real test

Table 1 shows the result of our model against the test datasetNumbers in the cells indicate whether there was agreement or discrep-ancy between model predictions and actual cases For example our

Table 1 Model accuracy against test dataset

ActualModel Threat Non-threat

Threat 74 13 Positive predictive value 085

Non-threat 73 327 Negative predictive value 082

Sensitivity Specificity Overall accuracy

050 096 082

14 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 15: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

model classified 327 cases (the lower right cell) as non-threat and thosearticles were in fact published during non-threat weeks In additionthe model classified 74 articles (the upper left cell) as threats and thosearticles were actually published during threat weeks When we applyour model to the test dataset its overall accuracy turns out 823roughly the same level we obtained with the training dataset Althoughour model is deduced from the training dataset it does not lose cate-gorizing capacity when applied to the test dataset Specifically of 487articles in the test dataset our model correctly classified 401 articles(823) as either threats or non-threats The model is very effective inidentifying harmless rhetoric from Pyongyang that is messages not re-sulting in actual military attacks When applied to a total of 340 non-threat articles in the test dataset our model successfully classified 327items as noise with only 13 misses (ie 967 accuracy) As a result itis very effective at filtering noise from Pyongyang By contrast thepattern-detecting accuracy of our model drops somewhat with respectto threats Of 147 threat articles in the test dataset it correctly identi-fies 74 items as threats while misrecognizing 73 threats as peaceful arti-cles (ie 50 accuracy)

3 Discussion

What does the model tell us in plain terms In particular what are themeanings of the key indicators for a North Korean military threat (ieyears signed assembly June and Japanese) To answer it is necessaryto go back to the original KCNA documents in order to understandthe contexts in which these terms were used A close reading of theKCNA articles that included the five key indicators or lsquoattack wordsrsquoreveals several interesting patterns

First the North Korean regime tends to emphasize its history ofmilitary struggle against foreign enemies before it launches armed prov-ocations It is then understandable that key terms such as lsquoyearsrsquo andlsquoJapanesersquo often appear in KCNA articles immediately before a militarystrike Prior to the second naval clash of Yonpyong on 29 June 2002for instance the North Korean government published a series of arti-cles in which it boasted of the lsquoyearsrsquo of North Korearsquos victoriousstruggles against foreign enemies including lsquoJapanesersquo colonialism(1910ndash45) It is possible that Pyongyang invoked the military legacy of

Detecting patterns in North Korean military provocations 15

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 16: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

its supreme leader Kim Il-sung in anticipation of an impending con-flict either to boost the morale of its people or to show the outsideworld that it was determined not to back down

Second the most perplexing term in Figure 4 is lsquoJunersquo because itcan be indicative of both threat and peace Specifically the word lsquoJunersquois an indicator of peace when associated with lsquoreunificationrsquo but it canalso be a sign of a military threat when it does not appear in conjunc-tion with lsquoreunificationrsquo A careful reading of the KCNA articles thatinclude the term lsquoJunersquo suggests an explanation for this strange phe-nomenon It turns out that the word lsquoJunersquo can refer to either theKorean War (which broke out on 25 June 1950) or the historic summitbetween Kim Dae-jung and Kim Il-sung on 15 June 2000 (which waspraised as a major cornerstone for future lsquoreunificationrsquo in the officialNorth Korean press) As a result when the word lsquoJunersquo appears inclose association with lsquoreunificationrsquo it refers to the 2000 summit (orlsquothe June 15 Summitrsquo as it is called in North Korea) thus indicatingthe peaceful intentions of Pyongyang By contrast when the wordlsquoJunersquo appears alone without connection to lsquoreunificationrsquo it refers tothe Korean War thus conveying a more hostile mood

Third it is also interesting to note that the North Korean govern-ment tends to publish many lsquosignedrsquo commentaries from RodongSinmun in the KCNA immediately before it launches military provoca-tions In these lsquosignedrsquo commentaries of Rodong Sinmun Pyongyangusually criticizes foreign countries (especially the United States Japanand South Korea) in very harsh terms on various issues Accordinglythe third term lsquosignedrsquo seems to correspond to the fact that RodongSinmun the official mouthpiece of the North Korean regime barksvery loudly before it actually bites its intended target A sudden in-crease in the number of lsquosignedrsquo commentaries of Rodong Sinmun inKCNA articles appears to be a reliable sign that the North Koreangovernment may be turning to an attack mode

Finally the last indicator of an imminent threat lsquoassemblyrsquo is theeasiest term to identify but the hardest one to interpret As shown inFigure 5 the term lsquoassemblyrsquo is primarily used in reference to the SPABecause the SPA is the highest North Korean authority with regard toits relations with foreign countries it is tempting to interpret a suddenincrease in references to the SPA prior to military provocations as indi-cating that Pyongyang is attempting to openly signal its hostile

16 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 17: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

intentions to the outside world In other words belligerent messagesemanating from the SPA (and reported in KCNA articles) may revealthe determination of the North Korean regime

Although plausible the problem with such an interpretation is thata close reading of several KCNA articles using the term lsquoassemblyrsquoshows that messages from the SPA are often not dire at all For exam-ple consider the following article published on 17 November 2010 justdays before the North Korean shelling at Yonpyong Island lsquoKim YongNam President of the Presidium of the DPRK SPA sent a message ofgreetings to Qaboos Bin Said Sultan of Oman on Wednesday on theoccasion of its national holidayrsquo (KCNA 17 November 2010) As thisexample illustrates a typical KCNA article with the word lsquoassemblyrsquodescribes rather routine business of the SPA such as how it sent a mes-sage of congratulations to a foreign country how it greeted visitingdignitaries from foreign countries and so on Clearly such mundanemessages cannot be a meaningful sign of an imminent North Koreanmilitary provocation While the publication of these articles (containingthe term lsquoassemblyrsquo) might possibly correspond to an uptick inPyongyangrsquos diplomatic efforts to strengthen its ties with foreign coun-tries and increase international support prior to a military attack weneed further analyses to substantiate such a hypothesis At the sametime it also seems clear from the test that the term lsquoassemblyrsquo does not

Figure 5 Social network analysis for key words in KCNA articles

Detecting patterns in North Korean military provocations 17

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 18: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

appear randomly and is somehow linked to the timing of NorthKorean military provocations Further research is necessary to make aproper interpretation of this seemingly irrelevant yet apparently signifi-cant indicator of North Korean military threats

31 Robustness checks

Before we conclude it is necessary to address five issues regarding therobustness of our findings (i) the rare-event issue (that is due to therare nature of a North Korean military attack our sampling ratio of710 should be justified) (ii) an out-of-sample alternative (that is in-stead of building our model by using 70 of data from the five NorthKorean military strikes and then applying it to the remaining 30 ofthe data it may be better to develop a model from the first threestrikes and then apply it to the remaining two North Korean provoca-tions) and (iii) a North Korean leadership change effect (that is in-stead of elaborating a single model covering 1997 to 2013 it may makesense to develop two separate models [one for the Kim Jong-il periodand the other for the Kim Jong-un era] in order to check whether thereis a leadership change effect) (iv) a South Korean leadership changeeffect (that is it may be better to elaborate two different models [onefor the lsquoconservativersquo period during Lee Myung-bak and Park Geun-hye and the other for the lsquoprogressiversquo era during Kim Dae-jung andRoh Moo-hyun] in order to investigate whether North Korea re-sponded differently to leadership changes in South Korea and (v) ano-Cheonan alternative (that is it is worth elaborating a model whileexcluding the Cheonan naval ship case because unlike the remainingfour attacks Pyongyang has denied its involvement in sinking theCheonan In this section these five issues are addressed to check therobustness of our model

First there is a discrepancy in the ratio of threat days vs non-threatdays between the data used in our analysis and the actual frequencyAs explained earlier we have defined the threat articles as those pub-lished one week prior to each actual crisis whereas the non-threatitems are defined as those chosen randomly for a 10-day period at leasttwo months before or after actual provocations For all five crises theratio of our data is then 3550 (in days) or 710 In reality however thenumber of peace days (ie 18 years minus 35 days) is much larger than

18 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 19: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

the number of crisis days (35 days) When we count all of the peacedays that are not included in our data (ie when we take all TrueNegatives into our data) the North Korean provocations become ex-tremely rare events As a result a possible criticism is that our ap-proach does not represent the true data generating process in a sam-pling procedure While military provocations occur rarely in realityour sampling procedure exaggerates their likelihood by significantly re-ducing the number of peace days in the data Put more technically ourmodel reduces the number of False Positives while increasing the num-ber of False Negatives

Despite the criticism our sampling strategy can be justified for tworeasons First the rare-event problem is not new in political scienceFor example King and Zeng (2001ab) addressed the same problem instatistical analysis (eg logit analysis with rare events) A commonlysuggested solution is the lsquochoice-based or endogenous stratified sam-plingrsquo in econometrics and the lsquocase-control designrsquo in epidemiology(Breslow 1996) The idea is to lsquoselect within categories of Yrsquo such thatall crisis days are sampled while we use a small random fraction ofnon-crisis days In this respect our sampling strategy is consistent withsuch an approach in that we have chosen all crisis days (35) and a ran-dom fraction of peace days (50) Not surprisingly such an approach iscommonly used in supervised machine-learning as well The rare-eventissue also known as a class imbalance problem is an ongoing issue insupervised machine-learning because the size of one class is often muchlarger than that of the other class (eg premature births violent civilconflicts fraudulent credit card transactions etc) In supervisedmachine-learning two solutions have been suggested a data samplingtechnique and an algorithmic modification technique Whereas datasampling is to under-sample the majority class while over-sampling theminority class a modification technique is to combine both over- andunder-sampling methods at the same time In this article we haveadopted a data sampling technique in order to address the class imbal-ance between threat days (a minority class) and non-threat days (a ma-jority class) in North Korean military provocations

Second although reducing the number of non-crisis days in ourdata (under-sampling) while using all the crisis days (over-sampling) isconsistent with both statistical and machine-learning literature therestill remains a question How far should we reduce the number of

Detecting patterns in North Korean military provocations 19

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 20: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

peace days Will a ratio of 12 13 or an even more skewed ratio gen-erate a better performance than our model using a 710 ratio To an-swer this question we pushed the ratio until we saw a clear sign of themodel becoming too skewed towards the majority class (peace days)In our case it occurred when the ratio between the two classes ex-ceeded 13 On this subject literature on machine-learning clearlyshows that the ratio of crisis and non-crisis should not be significantlyunbalanced According to Ertekin (2013) lsquo[i]n problems where theprevalence of classes is imbalanced it is necessary to prevent the resul-tant model from being skewed towards the majority (negative) classand to ensure that the model is capable of reflecting the true nature ofthe minority (positive) classrsquo Otherwise the generated model can sufferfrom an over-fitting problem For instance in the face of an extremelyskewed dataset (eg the real ratio of our data would be 35 crisis daysvs 6535 peace days) an automated machine-learning process naturallybecomes lsquogreedyrsquo that is in order to increase its overall accuracy it in-creasingly focuses on the larger category (6535 days) while virtually ig-noring the smaller category (35 days) Generating the model in thisway would be problematic however because our object is to focus onthe minority category of crisis days which is less than 05 of all daysfrom 1997 to 2013 After all we are trying to classify upcoming NorthKorean attacks not peaceful days

With this goal in mind we ran robustness checks to see if alternativemodels yield a better explanatory power when we increased the numberof peace days vis-a-vis threat days Because the maximum limit of anunbalanced ratio for automated machine-learning is 13 we attemptedboth 12 and 13 in our tests As Table 2 shows it is clear that the orig-inal model outperforms both alternatives Compared to the 12 ratiothe original model has more explanatory power in every way (iehigher overall accuracy higher threat accuracy higher non-threat accu-racy) By contrast the 13 ratio model has a slightly lower overall accu-racy marginally higher non-threat accuracy but devastatingly lowerthreat accuracy (175) than our original model because the increasingimbalance (from 710 to 13) turned the automated machine-learningprocess to a lsquogreedyrsquo mode As a result it is our conclusion that theoriginal model uses the golden ratio with the most accurate categoriz-ing capacity

20 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 21: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

Second it is also necessary to check whether an out-of-sample ap-proach is a better alternative to our original model In our analysis wedeveloped a model by using 70 of data from the five North Koreanattacks and then applied it to the remaining 30 of data from Onemay suspect however that a better approach is to elaborate a modelfrom the first three North Korean strikes (ie during the Kim Jong-ilperiod) and then apply it to the remaining two attacks (ie during theKim Jong-un era) in order to see whether it yields more explanatorypower As Table 2 shows the opposite turns out to be the case In factthe out-of-sampling model loses explanatory power in every possibleway (ie lower overall accuracy lower threat accuracy and lower non-threat accuracy) As a result our original model outperforms the out-of-sampling alternative

Third it is necessary to check whether or not there is a leadershipchange effect in North Korea Has the change of power from KimJong-il to Kim Jong-un produced such significant differences in theirleadership style that it may be worth developing two separate models(one for the Kim Jong-il period the other for the Kim Jong-un era)instead of a single model covering the entire period as we did As

Table 2 Comparison with alternative models

ModelsOverall accuracy (N)

Threat accuracy(Sensitivity)

Non-threataccuracy (Specificity)

Rare event I 724 328 823

12 Ratio (368508) (66201) (302367)

Rare event II 788 175 99

13 Ratio (656832) (36206) (620626)

Out of sampling 541 145 821

KJI KJU (6471195) (72495) (575700)

NK leadership KJI 685 (98143) 459 (2861) 793 (6582)

KJI vs KJU KJU 595 (195328) 612 (93152) 58 (102176)

SK leadership Con 614 (221360) 363 (58160) 815 (163200)

Con vs Prog Prog 681 (98144) 493 (3571) 863 (6373)

No cheonan 669 514 787

Case (273408) 92178 181230

Our Model 823 503 961

(401487) (74147) (327340)

Detecting patterns in North Korean military provocations 21

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 22: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

Table 2 shows the double platform alternative yields a worse outcomeoverall In fact its only advantage is that the Kim-Jong-un-model hasslightly higher threat accuracy (612) than our original model(503) In every other aspect however the Kim-Jong-un-model per-forms much worse Moreover the Kim-Jong-il-model has a much loweraccuracy in all respects than our original model As a result it is clearthat the double platform is a worse alternative to our single platformThe recent leadership change in Pyongyang has shown little effects asfar as its military provocations are concerned

Fourth it is important to investigate whether a leadership change inSouth Korea has any impact on North Korean military provocationsHas the oscillation of power between the lsquoprogressiversquo administrations(Kim Dae-jung and Roh Moo-hyun) and the lsquoconservativersquo regimes(Lee Myung-bak and Park Geun-hye) invoked different responses fromPyongyang If so we should expect a double platform (one model forthe conservative period vs the other model for the progressive era inSouth Korea) to outperform the original single platform As Table 2shows however the two separate models in the double-platform alter-native perform much worse than the original single platform in all cat-egories an overall accuracy threat accuracy and non-threat accuracyClearly the original model is a better choice It is shown above thatthe leadership change in Pyongyang has not produced significant policyshifts in its military provocations Likewise leadership changes inSouth Korea do not seem to have invoked major policy shifts inPyongyang as far as its military provocations are concerned

Finally it is worth examining an alternative model that excludes thesinking of the Cheonan naval ship case Unlike the remaining four at-tacks in our dataset the North Korean government has consistentlydenied its involvement in the Cheonan incident If Pyongyang had notreally sunk the Cheonan as it claimed it means that our original modelis performing below its full potential because it is built upon some er-roneous cases (ie the Cheonan incident) In that case we should ex-pect the no-Cheonan alternative to outperform our original model AsTable 2 shows however when we exclude all the data related to theCheonan case the resulting model loses much explanatory power com-pared to the original model Only in one category (a threat accuracy)the no-Cheonan-case alternative outperforms our original model buteven in that category the difference is ignorable (503 vs 514) In

22 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 23: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

all other categories the original model shows much stronger perfor-mances 823 vs 669 in an overall accuracy and 961 vs 787 ina non-threat accuracy Although Pyongyang has denied its involvementin the Cheonan incident the huge performance gap between the origi-nal model and the no-Cheonan alternative suggests otherwise

A different way of comparing performances of alternative models isshown in Figure 6 where the Receiver Operating Characteristic (ROC)is presented A ROC space is defined by a False Positive Rate (1 ndashSpecificity) in the x-axis and a True Positive Rate (Sensitivity) in the y-axis In a ROC space a theoretically perfect model (ie a combinationof 100 True Positive Rate with 0 False Positive Rate) appears in theupper left corner with the coordinates (0 1) In comparison a purelyrandom model such as a coin toss is found along a 45-degree diagonalline As a result good models (ie those performing better than ran-dom guesses) are found above the 45 degree line (especially close to they-axis) whereas bad models are located below it In Figure 6 the

Figure 6 Receiver operating characteristic (ROC) plot

Detecting patterns in North Korean military provocations 23

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 24: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

coordinates of our model (004 05) means that it has a False PositiveRate of 4 while its True Positive Rate is 50 By contrast the coordi-nates of alternative models in the ROC space are as follows (i) theRare Event I model with 12 ratio is (018 033) (ii) the Rare Event IImodel with 13 ratio is (001 018) (iii) the Out of Sampling (KJI KJU) model is (018 015) (iv) the Leadership Change (KJI-only)model is (021 046) (v) the Leadership Change (KJU-only) model is(042 061) (vi) the model for Conservative South Korean leadershipis (018 036) (vii) the model for Conservative South Korean leader-ship is (014 049) and (viii) the model that excludes the Cheonan caseis (021 051) As one can see all eight alternatives yield their coordi-nates either close to or lower than the diagonal line in the ROC spacewhereas our model lies above the diagonal line and close to the y-axisin the ROC space As a result we can conclude that the original modelperforms better than its potential rivals

4 Conclusion

In this article we studied articles published by the KCNA the officialnews outlet of North Korea in order to analyze patterns of its conven-tional military provocations To this end we have adopted a newmethod of automated text classification through supervised machinelearning Our model investigated the frequency of all terms appearingin KCNA articles immediately prior to five North Korean military at-tacks between 1997 and 2013 The frequency of these terms was thencompared with the frequency of terms appearing in KCNA articlespublished during peacetime without military provocations The com-parison brought to light a number of key terms ndash lsquoattack wordsrsquo so tospeak ndash whose appearances spiked in the KCNA prior to NorthKorean attacks Based on these terms our model correctly identifieseight of 10 articles as signs of imminent attacks or as peacetime newspieces

Specifically our model found five pattern-detecting terms of NorthKorean military threats lsquoyearsrsquo lsquosignedrsquo lsquoassemblyrsquo lsquoJunersquo andlsquoJapanesersquo For a proper analysis of their meaning we went back to thearticles in which these five attack words appeared in order to investi-gate the contexts in which they were used by the North Korean gov-ernment Our investigation shows that in the lead-up to an attack

24 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 25: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

Pyongyang displayed a strong tendency to emphasize the legacy of itsmilitary struggle against lsquoJapanesersquo colonialism and its fight against theUS imperialism during the Korean War (which began in lsquoJunersquo 1950)perhaps in an effort to heighten a domestic patriotic fervor at a timeof impending crisis In addition immediately before Pyongyang em-barks on hostile provocations the KCNA tends to increase its reprint-ing of lsquosignedrsquo commentaries from Rodong Sinmun which typically crit-icizes the United States South Korea and Japan in harsh termsprobably reflecting its deteriorating relations with the outside world

It seems that our methodology is promising to studies of interna-tional conflicts of various sorts First as shown in this article amachine-learning technology is useful in terms of detecting patternsthat distinguish threats of Pyongyang from its lsquonoisesrsquo or lsquobluffingrsquoSecond for other authoritarian regimes where there is a tight govern-ment controls over the mass media our approach can be used to detectpatterns symptoms or clues to escalating crisis Finally even for dem-ocratic countries with a free media our approach can be used on vari-ous occasions For instance a machine-learning technology can be uti-lized to analyze certain aspects of domestic politics such as signalingor communication between policymakers and the public To cite a con-crete example we can use a machine-learning technique to examinehow an aggressive foreign policy like lsquosabre rattlingrsquo affects public per-ceptions of fear or crisis in democracy Also a machine-learning ap-proach can be used to analyze external interactions of a democratic re-gime with other countries For instance we can test if there are anysignificant correlations between presidential elections in the US and thelevel of North Korean threats or between North Korean nuclear testsand varying public reactions in South Korea

As a final note it is not our contention that the model developed inthis article based on an automated machine-learning technique has de-tected intentional signals which Pyongyang sends to the outside worldimmediately prior to its military attack Instead the key attack wordswe have identified should be seen as signs or patterns that the NorthKorean regime unwittingly displays when it is inching toward a militaryoption For two reasons we doubt an intentional or signaling natureof the attack words in our model First if Pyongyang has indeed usedthe five attack words as a deliberate signal to the outside world that itis gearing up for military provocations why would it send the signal in

Detecting patterns in North Korean military provocations 25

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 26: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

such an arcane way that can be detected only through a complicatedautomated text classification process If the North Korean regime in-tends to send a signal to the outside world there is a better clearerand more visible option such as an Official Announcement by theMinistry of Foreign Affairs which Pyongyang has occasionally pub-lished in the pages of the KCNA on important issues Second if it hadbeen sending intentional signals of an impending military strike whywould the North Korean regime deny such an attack afterwards If re-peated an ex post denial would only reduce the credibility of an exante signal creating an image of North Korea as a casual bluffer Fortheir arcane nature and occasional ex post denial the five attack wordsin our model should not be treated as a premeditated signal fromPyongyang Instead they should be understood as inadvertent signs orpatterns unconsciously displayed by the North Korean governmentprior to a military attack In fact it is the unplanned nature of theseattack words that provides even more valuable information to the out-side world because it eliminates the possibility of feigned or false sig-nals from North Korea Like a pitcher who unknowingly flinches be-fore he throws a fast ball Pyongyang may unwittingly display certainpatterns before it launches a military strike

Acknowledgements

The publication of this article is supported by the National ResearchFoundation of Korea Grant funded by the Korean Government (NRF-2014S1A5A2A03065042 and NRF-2013S1A3A2055081) Previous ver-sions of this article were presented at the 2014 International StudiesAssociation Annual Convention (Toronto Canada) For questions re-garding the article please contact H Joo

References

Breslow NE (1196) lsquoStatistics in epidemiology the case-control studyrsquoJournal of American Statistical Association 91 433 14ndash28

Cha V (2010) The Sinking of the Cheonan Center for Strategic ampInternational Studies (2010 April 22) httpcsisorgpublicationsinking-cheonan (24 July 2014 date last accessed)

26 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 27: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

Choe S-H (2009) Korean Navies Skirmish in DisputedWaters New YorkTimes (2009 November 10) httpwwwnytimescom20091111worldasia11koreahtml_rfrac140 (23 June 2014 date last accessed)

Ertekin S (2013) lsquoAdaptive oversampling for imbalanced data classificationrsquoin G Erol amp L Ricard (ed) Information Sciences and Systems pp261ndash269 New York Springer

Global Security (2002) The naval clash on the yellow sea on 29 June 2002 be-tween South and North Korea the situation and ROKrsquos position (2002July 1) Global Security httpwwwglobalsecurityorgwmdlibrarynewsdprk2002dprk-020701-1htm (22 July 2014 date last accessed)

Hong S (2012) lsquoNorth Korearsquos capability to conduct provocations and ROK-US capability to counter themrsquo Korea Association of Defense IndustryStudies 19 2 135ndash136

Hopkins D and King G (2010) lsquoExtracting systemic social science meaningfrom textrsquo American Journal of Political Science 54 1 229ndash47

Jung S-C (2013) Internal Instability and External Provocation Seoul KINU

Jung S-Y (2008) lsquoA study on the origin of the pueblo incident on 1968rsquoKukjejongchiyonrsquogu 11 2 179ndash207

Jurafsky D and Martin J (2009) Speech and Natural Language ProcessingAn Introduction to Natural Language Processing Computational Linguisticsand Speech Recognition NJ Prentice Hall

Kang C-G (2013) lsquoA laboratory for recursive partyrsquo Hangukcontentshakhoe11 4 23ndash32

King G and Zeng L (2001a) lsquoLogistic regression in rare events datarsquoPolitical Analysis 9 2 137ndash163

King G and Zeng L (2001b) lsquoExplaining rare events in InternationalRelationsrsquo International Organization 55 3 693ndash715

Ko M-K (2015) lsquoNorth Korean military adventurism in the late 1960s andthe changes of party-military relationsrsquo Hyondaibukhanyongu 18 3 7ndash58

Lee W-K (2014) lsquoA case study on the provocations by NKrsquo Kunsaji 91 663ndash110

Litwak R (2007) Regime Change Washington DC Johns HopkinsUniversity Press

Macfie N (2013) The Battles of the Korean West Sea (2010 November 29)Reutershttpwwwreuterscomarticle20101129us-korea-north-clashes-idUSTRE6AS1AL20101129 (5 August 2014 date last accessed)

Mearsheimer J J (2001) The Tragedy of Great Power Politics New YorkWW Norton amp Company

Moore M and Hutchison P (2010) Yeonpyeong Island A History(2010November 23) The Telegraph httpwwwtelegraphcouknewsworldnews

Detecting patterns in North Korean military provocations 27

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4
Page 28: Detecting patterns in North Korean military provocations ...idse.or.kr/file/Whang_1.pdf · that North Korea may have because of increasing power asymmetry since the collapse of the

asiasouthkorea8155486Yeonpyeong-Island-A-historyhtml (7 August 2014date last accessed)

Oh I-W (2011) lsquoSouth Korearsquos countermeasures against North KorearsquosArmed Provocation and Offensive dialogue proposalrsquo Trsquoongiljyollyak 111 227ndash266

Rich T (2012a) lsquoLike father like son Correlates of leaderhip in NorthKorearsquos english language newsrsquo Korea Observer 43 4 649ndash674

Rich T (2012b) lsquoDeciphering North Korearsquos nuclear rhetoric an automatedcontent analysis of KCNA newsrsquo Asian Affairs An American Review 3973ndash89

Sigal L (1998) Disarming Strangers Nuclear Diplomacy with North KoreaPrinceton Princeton University Press

Sohn J-Y (2002) South North Korea clash at sea CNN (2002 June 29)httpeditioncnncom2002WORLDasiapcfeast0629koreawarships (21July 2014 date last accessed)

Sudworth J (2010) How South Korean Ship was Sunk BBC News (2010May 20) httpwwwbbccouknews10130909 (4 August 2014 date lastaccessed)

28 Whang Lammbrau and Joo

by guest on October 23 2016

httpirapoxfordjournalsorgD

ownloaded from

  • lcw016-FN1
  • lcw016-FN2
  • lcw016-FN3
  • lcw016-FN4