searching for patterns in political event sequences: experiments with the keds database

This article was downloaded by: [TOBB Ekonomi Ve Teknoloji]On: 20 December 2014, At: 02:26Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number:1072954 Registered office: Mortimer House, 37-41 MortimerStreet, London W1T 3JH, UK

Cybernetics andSystems: AnInternational JournalPublication details, including instructionsfor authors and subscription information:http://www.tandfonline.com/loi/ucbs20

SEARCHING FORPATTERNS IN POLITICALEVENT SEQUENCES:EXPERIMENTS WITH THEKEDS DATABASEKlaus Kovar, Johannes FÜrnkranz, JohannPetrak, Bernhard Pfahringer, RobertTrappl, Gerhard Widmera Austrian Research Institute for ArtificialIntelligence, Vienna, Austriab Department of Medical Cyberneticsand Artificial Intelligence, University ofVienna, Vienna, AustriaPublished online: 29 Oct 2010.

To cite this article: Klaus Kovar, Johannes FÜrnkranz, Johann Petrak,Bernhard Pfahringer, Robert Trappl, Gerhard Widmer (2000) SEARCHINGFOR PATTERNS IN POLITICAL EVENT SEQUENCES: EXPERIMENTS WITH THEKEDS DATABASE, Cybernetics and Systems: An International Journal, 31:6,649-668, DOI: 10.1080/01969720050143184

To link to this article: http://dx.doi.org/10.1080/01969720050143184

http://www.tandfonline.com/loi/ucbs20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/01969720050143184

http://dx.doi.org/10.1080/01969720050143184

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy ofall the information (the “Content”) contained in the publicationson our platform. However, Taylor & Francis, our agents, and ourlicensors make no representations or warranties whatsoever asto the accuracy, completeness, or suitability for any purpose ofthe Content. Any opinions and views expressed in this publicationare the opinions and views of the authors, and are not the viewsof or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verifiedwith primary sources of information. Taylor and Francis shall not beliable for any losses, actions, claims, proceedings, demands, costs,expenses, damages, and other liabilities whatsoever or howsoevercaused arising directly or indirectly in connection with, in relationto or arising out of the use of the Content.

This article may be used for research, teaching, and privatestudy purposes. Any substantial or systematic reproduction,redistribution, reselling, loan, sub-licensing, systematic supply,or distribution in any form to anyone is expressly forbidden.Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

SEARCHING FOR PATTERNS IN POLITICALEVENT SEQUENCES: EXPERIMENTS WITH THEKEDS DATABASE

KLAUS KOVAR, JOHANNES F �URNKRANZ,JOHANN PETRAK, BERNHARD PFAHRINGER

Austrian Research Institute for Arti® cial Intelligence,Vienna, Austria

ROBERT TRAPPL, GERHARD WIDM ER

Department of Medical Cybernetics and Arti® cialIntelligence, University of Vienna, Vienna, Austria

This paper presents an empirical study on the possibility of discoveringinteresting event sequences and sequential rules in a large database of inter-national political events. A data mining algorithm ¢rst presented by Mannilaand Toivonen (1996), has been implemented and extended, which is ableto search for generalized episodes in such event databases. Experiments con-ducted with this algorithm on the Kansas Event Data System (KEDS)database, an event data set covering interactions between countries in thePersian Gulf region, are described. Some qualitative and quantitative resultsare reported, and experiences with strategies for reducing the problem com-plexity and focusing on the search on interesting subsets of events aredescribed.

M OTIVATION

Recent years have witnessed an increasing interest in the potential ofdata mining methods not only in business and technical applications,but also in the social sciences. As more and more empirical data are

Cybernetics and Systems: An International Journal, 31 : 649± 668, 2000Copyright # 2000 Taylor & Francis

0196-9722/00 $12.00 + .00

This research was sponsored by the Austrian National Bank (OeNB) under grant 7295.Financial support for the Austrian Research Institute for Arti¢cial Intelligence is providedby the Austrian Federal Ministry for Science and Transport (BMWV).

Address correspondence to OFAI, Schottengasse 3, 1010 Wien, Austria.

649

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

collected, the development of computer methods for searching forinteresting and possibly predictive patterns in large databases becomesimportant.

In the political sciences, in particular, a number of databases havebeen established that document various aspects of international politicsand its history, e.g., the CONFMAN (Bercovitch and Langley, 1993),SHERFACS (Sherman, 1988), and KOSIMO (Pfetsch and Billing, 1994)databases, to name but a few. In previous investigations, it has beenshown that data mining methods can indeed ¢nd predictive patterns(mostly classi¢cation rules) in such databases (see, e.g., FÏrnkranz etal. (1997) and Trappl et al. (1996)).

The present study continues this line of research. Its purpose is toinvestigate whether interesting temporal patterns and dependenciescan be discovered in large temporally ordered datasets. Knowledge ofsuch temporal dependencies or action patterns might be useful, e.g.,in early warning settings. An international event database with rathersimple structure, but with a very large number of entries (more than300,000) was selected. From a technical point of view, there was a par-ticular interest in the complexity aspects introduced by the sheer sizeof such a database, and in possible ways of reducing that complexity.To that end, a well-known algorithm that searches for temporal depen-dencies was reimplemented, and extended with some facilities that allowthe user to explicitly de¢ne the search space and thus in£uence the prob-lem complexity. Some illustrative experimental results will be presentedand various strategies for making such an approach computationallyfeasible will be discussed.

THE KEDS DATABASE

The KEDS database (Gerner and Schrodt, 1996) is a WEIS-coded eventdata set that describes bilateral political events in the Arabian/PersianGulf region (that includes countries like Iran, Iraq, Kuwait, Oman,Saudi Arabia, Yemen, and the smaller Gulf states, but also Israel, itsarabic neighbors, and various major international organizations). Thecoded events cover political and military interactions between these par-ties from the period April 1979 to March 1999. The data set was gen-erated automatically from Reuters news reports downloaded from theNEXIS data service. The data was coded with the Kansas Event Data

650 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

System (KEDS) machine-coding program into the very simple andabstract general form indicated in Figure 1.

IRN or IRQ are abbreviations for countries involved in interactions(Iran and Iraq in this case). The ¢rst country (the ¢eld named sourcecode in Figure 1) is the ``active’’ country, the actor who caused the event.The second (target code) is the ``passive’’ one. The event codes, whichdescribe the actions between the two countries (or internationalorganizations), come from a set of 110 prede¢ned actions used inthis database. For details on KEDS see Schrodt and Gerner (1998).The data set can be downloaded from the KEDS websitehttp://www.cc.ukans.edu/¹keds/

In this experiment, two versions of the database were used. The ¢rstversion consists only of events coded from the lead sentences of theReuters reports and has 57,131 entries. The second version containsevents derived from the full news agency articles and comprises 304,402entries. Both databases cover the time from 15 April 1979 to 31 March1999.

SEARCHING FOR TEM PORAL DEPENDENCIES IN THE KEDSDATABASE

As mentioned in the introduction, the goal of this research was to inves-tigate whether interesting temporal event patterns, possibly useful inan early warning setting, could be extracted from this database withthe help of data mining methods. The patterns will take the form of tem-poral dependency rules (or episode rules). The algorithm that has beenimplemented for this purpose is based on the algorithm by Mannila

Figure 1. Structure of entries in the KEDS database .

SEARCHING FOR PATTERNS IN POLITICAL EVENT SEQUENCES 651

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

and Toivonen (1996). Similar algorithms have been proposed by Srikantand Agrawal (1995, 1996).

The Basic Algorithm

A number of basic concepts have to be de¢ned before the algorithm canbe properly presented. The following de¢nitions are based on Mannilaand Toivonen (1996).

An event sequence is an ordered sequence of events. Each event isdescribed by several attributes and an associated time of occurrence.Extensionally, an event corresponds to a row in the data set. In thisdatabase an event is described by the attributes Actor1 (¢eld sourcecode), Actor2 (target code), and Action (event code). The date ¢eldspeci¢es the time of occurrence.

An episode is a combination of events with a partially speci¢edorder; episodes are the basic concepts from which event prediction ruleswill be derived. They are represented as conjunctive patterns of con-ditions over attributes. Figure 2 gives an example of a (¢ctitious)episode.

This episode describes situations where Israel asks the U.S. for aid,and the U.S. (later) gives some assistance. x and y are event variablesthat represent individual events and implicitly specify the temporal orderof the episode. Ask-for-aid and assistance are predicates on actions thatare true if an action belongs to a particular (user-de¢ned) class. Gen-erally, a predicate is a condition that can be true in one event or acrosstwo events. Two kinds of predicates are considered, unary (which applyto one event) and binary ones (which apply to one or two events).An episode occurs in an event sequence if there is a (not necessarily con-tiguous) sequence of events that satisfy the conditions in the episode.The set of occurrences thus constitutes the extension of the episode.

Figure 2. Intensional de¢nition of an episode.

652 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

An occurrence of an episode P is minimal if P does not occur at anyproper subinterval. The set of minimal occurrences of an episode P isdenoted by mo(P). The size of an episode is the number of predicatesthat are true in the episode. An episode is simple if it contains no binarypredicates. And ¢nally, an episode is called frequent if the number ofits minimal occurrences jmo(P)j is larger than a given frequencythreshold min-freq.

The algorithm consists of two parts. It ¢rst ¢nds all frequentepisodes in the data, and then generates rules from these that satisfy cer-tain con¢dence conditions.

Finding frequent episodes. The following algorithm, which ¢nds all fre-quent episodes for a given frequency threshold, is taken directly fromMannila and Toivonen (1996).

The basic algorithm is very similar to the Apriori-Algorithm intro-duced in Agrawal and Srikant (1994). In the ¢rst step one looks forepisodes of size 1 in the database, meaning that a search is made forepisodes including one predicate. In the following iteration the algorithmalways selects two ``joinable’’ episodes, creates a candidate episode withan incremented size from these, and computes the minimal occurrencesof this new episode from the minimal occurrences of the two sub-episodes. If the frequency of the resulting candidate episode is greaterthan or equal to the user-de¢ned min-freq, the episode is retained.The issue of selecting the two subepisodes, P1 and P2 , in line 10 isnot entirely trivial; Mannila and Toivonen (1996) discuss this in a bitmore detail. The loop continues until no more episodes with a frequencygreater than or equal to the user-de¢ned min-freq can be found.

Algorithm 1: Discovery of frequent simple serial episodes (Mannilaand Toivonen, 1996)

Input: event sequence Sunary predicates Gbinary predicates Dfrequency threshold min-freq

Output: all frequent simple episodes from the set ofpossible episodes ES(G, D)(and their minimal occurrences)


Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

1. C1: ˆ the set of episodes of size 1 in es(G,D):

2. for all P2C1 compute mo(P);

3. L1: ˆ {P2C1jP is frequent};

4. i: ˆ 1;

5. while Li 6ˆ ; do

6. Ci‡1: ˆ {P j P2Es(G,D), subepisodes of P are in Li };

7. i‡‡;

8. for all P 2 Ci do

9. select subepisodes P1 and P2;

10. Compute mo(P) from mo(P1) and mo(P2);

11. end;

12. Li: ˆ {P 2Ci j P is frequent};

13. end;

14. for all i and all P 2Li do

15. output P and mo(P);

16. end;

For this application, some little changes to the original algorithmhave been made (mainly to do with binary predicates and searchschemata), but the basic structure has remained unchanged.

Generating prediction rules. Extracting rules from the found frequentepisodes is straightforward. An episode rule is an expressionP[V]ˆ Q[W], where P and Q are frequent, and V and W are realnumbers indicating the maximum time that may pass between the startof the ``antecedent episode’’ P and the end of P, and between the startof P and the end of the ``consequent episode’’ Q, respectively. Interestis in all rules with a given minimum con¢dence, where the con¢denceis computed essentially as the ratio of the number of (minimal) occur-rences of P followed by Q in the time window start(P) ‡ W versusthe number of minimal occurrences of P within start(P) ‡ V. If theresulting value is at least as large as a user-speci¢ed minimal con¢dencemin-conf, the rule is considered as valid.

To illustrate, Figure 3 shows an episode rule. Here, the interpret-ation is that if Israel asks the U.S. for some kind of support and the U.S.promises support within 7 days, the support is given within another 7days.

In the remainder of the paper, the short-hand notation that is dis-played in the lower part of Figure 3 will be used.

654 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

De® ning the Search Space

To allow the user to easily customize the representation language of thesystem and to guide and constrain the search in a problem-dependentway, two quite straightforward extensions to the basic algorithm havebeen devised.

Unary and binary predicates. The ¢rst concerns the possibility for theuser to de¢ne a hierarchy of predicates. Generally, a predicate is a con-dition on the attributes of an event, or a constraint on attributes oftwo different events. The former type is referred to as unary predicates.Unary attributes either test for equality with a constant, as in

Actor1 ˆ USA

or they test whether an attribute value belongs to a prede¢ned class. Thatmeans that the user will be able to create his or her own taxonomy ofpredicates. For example, if the user has made the de¢nition

arabic-countr y(X):

SYR

JOR

IRQ

¢ ¢ ¢

Figure 3. An episode rule in long and short-hand notation.


Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

then arabic-country (x.Actor1) will be true in an event x if the value ofits attribute Actor1 occurs in the de¢nition of arabic-country . The usercan specify a hierarchy of predicates.

Analogously, binary predicates allow testing for equality, i.e.,

equal(x.Actor1,x.Actor2)

and again testing for the occurrence in a prede¢ned set of constants like

neighbors(x.Actor1,x.Actor2),

where the relation neighbors might be de¢ned as:

neighbors(X,Y):

IRQ, IRN

ISR, JOR

ISR, SYR

¢ ¢ ¢ ¢

Readers familiar with the ¢rst-order inductive learning (FOIL) algo-rithm (Quinlan, 1990) will recognize the similarity between FOIL inputspeci¢cations and the chosen format.

Dependency schemata. To allow for a focused, problem-speci¢c searchon the data, a simple language was created that allows the user to specifyconditions on the patterns he wants to search for. An example of such aschema is:

Actor1 ˆ USA, Actor2 ˆ ISR

Actor1 ˆ ISR, Actor2 ˆ USA

ask-for-aid(Action)

assistance(Action)

promise-aid(Action)

where one searches for rules that involve the U.S. and Israel (as eitheractive or passive actor) and which involve some kind of help action(where ask-for-aid etc. are again user-de¢ned unary predicates onactions).

656 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

Another example for a search schema including the search for binarypredicates is:

neighbors(Actor1,Actor2)

threat(Action)

protest(Action)

Action ˆ 223

where one is searching for threats and protests (user-de¢ned abstractpredicates) and the speci¢c action 223 between neighbors. 223 is theKEDS event code for a military engagement.

EXPERIMENTAL RESULTS

Database Characteristics

As mentioned above, two databases were being used, one that originatedfrom the lead texts of REUTERS newswire articles, and one that resultsfrom the entire text of the same articles. From the latter, a third setwas constructed that consisted of the entries that concerned the secondGulf War (Iraq’s occupation of Kuwait and the U.S.’s intervention)and its consequences. For this set, the last 1022 days of the full datasetwere used so that all events from the crisis itself were included, as wellas all subsequent events up to the end of the database.

Table 1 shows some characteristics of these three datasets. The fullset contains 304,402 events. All together about 200 different actorsappear in the active or passive actor ¢elds, 66 different actions appear

Table 1. Characteristics of the used datasets.

Lead Full Gulf

No. Events 57,131 304,402 57,195No. Unique Daily Events 54,466 265,273 46,267No. Di¡erent Events 17,123 49,550 14,438No. Days 7,291 7,291 1022Avg. Unique Events/Day 7.47 36.38 45.27No. Actor 1 164 199 157No. Actor 2 166 196 165No. Action 65 66 64


Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

in the events (which means that about one-half of the 110 events of theWEIS coding do not appear at all). On the average, each day contains36 events. In the subset that is concerned with the Gulf War, eachday contains about 45 events, indicating that this is a period of increasedactivity. The lead dataset contains much fewer events, which re£ects itsencoding for which not the entire newswire articles were used.

The gulf data set has about the same number of examples as the leaddataset. However, they differ in some of their other characteristics, inparticular in the number of unique daily events (counting identical eventsonly once per day) and the number of different events (counting identicalevents only once for the entire dataset). The reason for this is that theevents from the gulf dataset originate from fewer newswire articles,which are described in more detail and are thus more repetitive (i.e.,through referral to recent past events). All subsequent experiments wereperformed on sets with unique daily events.

The four most frequent actions in the full database are 023 (neutralcomment), 031 (meeting), 223 (military engagement), 032 (visit), eachof which occurs more than 20,000 times, and more than 4,000 timesin the lead database. All other types of actions occur signi¢cantly lessfrequently. The most frequent actors are the U.S., Iran, Iraq, followedby the United Nations and Israel. It is interesting to note that whilethe U.S. is the most frequent ¢rst, active actor by a wide margin, Iraqis the most frequent second, passive actor by an equally wide margin(both occur more than 50,000 times in the full set, followed by the otherof the two and Iran with slightly more than 30,000 occurrences). Themost frequent episodes are military engagements between Iran and Iraq,Iraq and Iran, U.S. and Iraq, and Israel and Lebanon, followed byneutral comments between these parties and the United Nations’ visitsto Iraq.

Quantitative Results

In a ¢rst experiment, the authors tried to ¢nd associations betweenevents. The lead and gulf datasets were analyzed with the same set ofparameters: The time window was chosen to incorporate 7 days (W ˆ 7and V ˆ 7), episodes of length up to 4 were looked for, and a set of17 unary predicates was used that encoded abstractions of the action¢elds. For the lead and gulf datasets, min-freq (mF) was set to 20and min-conf (mC) was set to 0.5. For the data set that contained all

658 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

examples generated from the full texts more restrictive values (mC ˆ 0.7,mF ˆ 100, V ˆ W ˆ 5) had to be used. The number of rules found in theother datasets was also given using these parameters to allow a roughextrapolation.

The difference between the lead and the gulf datasets in terms of thenumber of episodes and sequences found is striking. Although both setsare of similar size, the gulf set seems to be much more repetitive (eventhough duplicate events that occurred on the same day had already beenremoved). This, of course, is again explained by the lack of detail inthe encoding. Table 2 clearly illustrates the effect of increasedrepetitiveness on the complexity of the task.

Figure 4 shows the number of found rulesöon a logarithmicscaleöfor various settings of the minimum con¢dence and frequencyparameters for the gulf data set. Apparently, the number of rules foundgrows doubly exponentially with decreasing minimum frequency.Lowering minimum con¢dence, on the other hand, results in asubexponential growth of the number of found rules. A look at the orig-inal (nonlogarithmic) scale (not shown) shows that the growth is stillsuperlinear. Note that the growth along the minimum frequency axis(for constant minimum con¢dence) roughly corresponds to the growthin the number of episodes found and thus to the time complexity ofthe task. The same experiment was repeated for the lead dataset, wherethe shape of the mesh looked almost identical, although the absolutenumbers of found rules were lower by two orders of magnitude (cf.Table 2).

Table 2. Number of found episodes and rules (events with 17 abstract action predicates)

Lead Gulf Full

mF ˆ 20, mC ˆ 0.5, V ˆ7, W ˆ 7No. Episodes 8,073 167,993 n.a.No. Rules 2,307 104,449 n.a.

mF ˆ 100, mC ˆ 0.7, V ˆ 5, W ˆ 5No. Episodes 94 153 12,146No. Rules 0 26 1,746


Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

Anecdotal Results

In this section, some of the rules that resulted from our experiments areshown. In general, ¢nding interesting rules in the large amounts of rulesthat are generated is a data mining task in itself. The rules was highestcon¢dence and support are typically not very interesting. For example,the rule with the highest con¢dence and frequency in the gulf set isthe rule:

USA IRQ demo

IRQ USA 223

USA IRQ 223

ˆUSA IRQ 223

(C: 1.0, F: 45)

Demo is an abstract predicate that comprises military and nonmilitarydemonstrations and 223 is the event code for military engagement. Sub-sequent rules are of a similar nature. For example, the following rule

Figure 4. Number of rules found in the gulf set for various settings of the minimumcon¢dence and minimum frequency parameters (logarithmic z axis).

660 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

USA IRQ comment

RUS IRQ comment

UNO IRQ comment

ÛNO IRQ comment

(C: 1.0, F: 38)

seems to represent a typical communicative sequence. Repeated con-ditions, as in the previous rules, happen quite frequently. For example,the three most frequent rules are all based on a four-fold repetitionof the event USA IRQ comment:

USA IRQ comment

USA IRQ comment

USA IRQ comment

ÛSA IRQ comment

(C: 0.66, F: 381)

USA IRQ comment

USA IRQ comment

ÛSA IRQ comment

USA IRQ comment

(C: 0.75, F: 337)

USA IRQ comment

ÛSA IRQ comment

USA IRQ comment

USA IRQ comment

(C: 0.87, F: 289).

The last case, for example, means that whenever the U.S. has com-mented three times within 7 days on the IRQ, there is a 87% chance thata fourth comment happened later in the same time window.

Clearly, such rules are of limited interest. In order to aid one in thetask of detecting more interesting rules, a variety of post-processing¢lters were written that allowed the authors to ef¢ciently sift throughthe output of the rule ¢nder, to sort the rules according to variouscriteria, and to extract those that match speci¢ed patterns on the


Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

right-hand side or left-hand side of the rule. For example, ¢ltering out allrules in which the same condition occurs more than once reduces thenumber of found rules for the parameter setting in the upper half ofTable 2 from 104,449 to 38,393. Among these, the most frequent is

UNO IRQ meet

IRQ UNO meet

IRQ UNO protest

ÛNO IRQ comment

(C: 0.63, F: 79).

Focusing on more interesting subsets has also been attempted. Forexample, ¢ltering for rules that predict a threat gives the following mostfrequent rule:

USA IRQ comment

IRQ UNO protest

UNO IRQ comment

ÛSA IRQ threat

(C: 0.56, F: 70).

A rule with con¢dence 1 for predicting a military action (223) of the U.S.against Iraq is for example:

IRQ UNO diplomatic restriction

UNK IRQ 223

ÛSA IRQ 223

(C: 1.0, F: 20).

An analysis of particular questions has also been tried. Kovar (1999)shows several rules detected in the ¢rst Gulf War (IRN/IRQ), thesecond Gulf War (USA/IRQ), the relations between the U.S. and Israel,Israel and Palestinian organizations, Israel and Lebanon, and Israeland its Arabic neighbors. Each question was investigated with a varietyof parameter settings. The procedure was as follows: First, rules betweena limited set of participants were looked for. This can be done with arelatively low minimum frequency. Then, the minimum frequency was

662 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

gradually increased and additional unary predicates were added to thebackground knowledge (adding unary predicates can be seen as a stepfor generalizing the found rules; hence, the frequency of the rules canbe expected to go up). The window size parameters W and V were alsovaried. The results ranged from simple action sequences (a militaryengagement following another military engagement within 2 days witha con¢dence of 95.26%, and within 20 days with a con¢dence of 99.99%),to more speci¢c rules that seem to constitute characteristic (thoughdeplorable) action patterns, as in

LEB ISR violation

LEB ISR violation

ÎSR LEB 223

(C: 0.61, F: 60, V: 10, W: 20)or

USA ISR protest

ISR LEB 223

ÎSR LEB 223

(C: 0.77, F: 60, V: 10, W: 20).

Ef ® ciency and Preprocessing Issues

Although the a priori algorithm and its derivatives are typically veryfast, the size of the current database allowed the study of only subsetsof the dataset. In order to answer interesting questions like

``What events imply a military engagement between A and B within thefollowing n days?’’

basically all possible episodes for the interesting time window have to begenerated, and then this set has to be ¢ltered for occurrences that involve`À B 223’’ on the right-hand side. Likewise, for analyzing the conse-quences of certain events, the results (of a complete search) have tobe ¢ltered for occurrences of the events in question in the left-hand sideof a rule.

What would be desirable is a ¢lter that can effectively constrain thesearch for such types of problems. This, however, is not possible because


Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

such a constraint should only apply to one event of an episode and leaveall degrees of freedom to the rest. The authors have experimented withstraightforward preprocessing techniques that reduce the number ofexamples for such types of problems.

For example, if one is interested in events that predict a militaryengagement between Israel and Lebanon within 1 week, one marksall occurrences of ``ISR LEB 223’’ and removes all events that occurredmore than 7 days before such an event. In this example, this procedurereduced the size of the dataset from 304,402 to 160,071, i.e., to almostone-half the size of the original set. While the rule ¢nder will only detecta fraction of the frequent episodes of the original dataset (and thusbe more ef¢cient), all of the rules that are relevant to the subsequent ¢lterquery should still be present.

Binary Predicates

The biggest failure of the project so far was the use of binary predicates.An attempt was made to ¢nd simple rules that involve an equality ora neighbor predicate, but the process is too inef¢cient to be tractablein reasonable times for any of the three datasets that were looked at.A fundamental problem was already noted by Mannila and Toivonen(1996). In the terminology of their paper, Lemma 7 does not hold forgeneral episodes. The problem is that episodes that involve binary predi-cates cannot be detected solely with minimal episodes because an occur-rence of the episode equal(x.Actor1,y.Actor1) may be nestedwithin another occurrence (see Manilla and Toivonen (1996) for a moredetailed example). In fact, we believe that the problem is more prevalent:In the simple event sequence A,B,B,C (A, B, and C represent events) theminimal occurrences of the episodes A,B and B,C do not overlap. Hence,the episode A,B,C cannot be detected using a simple join of the episodesA,B and B,C. This problem was simply ignored in both cases, in hopethat this sacri¢ce of completeness would buy the necessary gain inef¢ciency.

However, the main reason for the failure to get usable resultsinvolving binary predicates seemed to be that computation of thematches for a binary predicate is computationally too expensive forthe task at hand. Imagine, for example, that frequent episodes are lookedfor in a time window of 7 days. In the simplest case, the lead dataset, onehas about 7.5 events per day, i.e., the window encompasses more than 50

664 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

events. All possible combinations of these events have to be examined todetermine that a binary predicate does not occur in this window.1 Hence,the effort is quadratic in the number of events in the window. Note that itis not possible to use information obtained from the unary predicates torestrict the search space. The predicate equal…x Action y Action†may be frequent, although none of its instantiations (concrete valuesfor the action) may be frequent.

The conclusion from this is that the use of binary predicates isintractable for applications with larger time windows such as this one.In future work, more sophisticated, possibly heuristic search techniqueswill have to be developed to cope with this problem, or alternatively,more intelligent preprocessing techniques have to be applied that allowoneöin cooperation with a domain expertöto reduce the number ofevents per day by dynamically ¢ltering out events that are irrelevantto the particular questions.

RELATED WORK

Oates et al. (1998) brie£y describe some anecdotal results of applyingtheir multievent dependency detection (MEDD) algorithm to the KEDSdata. However, they only used a tiny subset of 886 events, and theyapparently did not go on to investigate the KEDS database in any moredetail. Our experiments seem to be the ¢rst large-scale attempt atapplying symbolic data mining algorithms to the KEDS database inits entirety.

Schrodt (1997) uses hidden Markov models to detect early warningindicators (3 to 6 months lead) for Tit-for-Tat (TFT) con£icts betweenIsrael and Lebanon, i.e., the appearance of ``ISR LEB 22x’’ followedby a counter-action ``LEB ISR 22x’’ within 2 days. LEB typicallystands for PLO (before 1982) or for Amal or Hizbollah militia (after1985).

Our methods have the potential of producing results that could beinterpreted as early warning indicators. For example, with the appli-cation of the preprocessing technique brie£y described in the previoussection, several patterns were discovered that were indicative for a mili-

1 Determining that it occurs may be a little faster, in particular, if one is only interestedin computing minimal occurrences.


Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

tary action of Israel against Lebanon. The more interesting rules in thiscase look like this:

LEB ISR 223

ISR LEB threat

ˆISR LEB 223

(C: 0.79, F: 77).

Violent actions of Lebanese groups against Israel that are followed by athreat of Israel, typically (79% of the time) result in an Israelicounteraction within 7 days (the window size that was chosen for thisanalysis). However, once more, the main de¢ciency is the lack of ef¢cientmeans for focussing the search on the interesting parts, which makes bigsearches that would incorporate windows of, say, 30 days or more,infeasible for the entire database.

DISCUSSION

This work is viewed as a ¢rst study that delivers ¢rst results but mainlyfocuses on identifying the problems that have to be tackled in futurework.

The main weakness of this study is a certain lack of focus withrespect to the political results. This was caused by two major factors:the abundance of found rules (the vast majority of which were notinteresting) and the lack of input from a domain expert.

From a technical point-of-view, the main result is the impracticalityof the straightforward approach for incorporating binary predicates.Mannila and Toivonen (1996) have predicted this problem but eventhe simplistic approach of sacri¢cing completeness for a gain inef¢ciency did not yield satisfying results. This is seen as a main topicfor further research.

From a political science point-of-view, it should be stressed that onlya fraction of the rules that look interesting can be shown. Kovar (1999)describes many more experiments, and there are innumerable other ques-tions that have not been investigated. Also, the authors are laymen inpolitical science. These results should be discussed with politicalscientists, or, better, the program should be extended so that an unskilleduser can interactively query the database for interesting patterns, pref-

666 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

erably with a visual query interface. Currently, an attempt to contactpolitical scientists is being made for help in focusing on importantresearch questions on this and related problems.

In any case, these results have demonstrated that a fully automatedapproach to detecting interesting episodes or rules, for example, withthe goal of early warning, has its limitations on the databases analyzed.At least in this domain, high frequency and high con¢dence are notnecessarily correlated with interestingness. Additional domain knowl-edge is needed for focusing the search, and additional algorithmicresearch is needed for doing this ef¢ciently.

REFERENCES

Agrawal, R., and R. Srikant. 1994. Fast algorithms for mining association rules inlarge databases. In Proceedings of the Twentieth International Conference onVery Large Data Bases (VLDB ’94), 487^499.

Agrawal, R., and R. Srikant. 1995. Mining sequential patterns. In Proceedings ofthe Eleventh International Conference on Data Engineering (ICDE ’95), 3^14,Taipei, Taiwan.

Bercovitch, J., and J. Langley. 1993. The nature of dispute and the effectiveness ofinternational mediation. Journal of Con£ict Resolution 37(4):679^691.

FÏrnkranz, J., J. Petrak, and R. Trappl. 1997. Knowledge discovery in inter-national con£ict databases. Applied Arti¢cial Intelligence 11(2):91^118.

Gerner, D. J., and P. A. Schrodt. 1996. The Kansas Event Data System: Abeginner’s guide with an application to the study of media fatigue in thePalestinian Intifada. http://ravens.ukans.cc.edu/¹keds

Kovar, K. September 1999. Data mining in EreignisfolgenöExperimente mit derKEDS-Datenbank. Master’s Thesis, Department of Medical Cyberneticsand Arti¢cial Intelligence, University of Vienna (in German).

Mannila, H., H. Toivonen, and A. I. Verkamo. 1995. Discovery of frequentepisodes in event sequences. In Proceedings of the First International Confer-ence on Knowledge Discovery and Data Mining (KDD’95), 210^215 .

Mannila, H., and H. Toivonen. 1996. Discovering generalized episodes usingminimal occurrences. In Proceedings of the Second International Conferenceon Knowledge Discovery and Data Mining (KDD’96).

Oates, T., and P. R. Cohen. 1996. Searching for structure in multiple streams ofdata. In Proceedings of the Thirteenth International Conference on MachineLearning, 346^354.

Oates, T., D. Jensen, and P. R. Cohen. 1998. Discovering rules for clustering andpredicting asynchronous events. In Predicting the Future: AI Approaches toTime Series Workshop, AAAI-98, 73^79.


Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

Pfetsch, F. R., and P. Billing. 1994. Datenhandbuch nationaler und internationalerKon£ikte. Nomos Verlagsgesellschaft, Baden-Baden.

Quinlan, J. R. 1990. Learning logical de¢nitions from relations. MachineLearning 5:239^266.

Schrodt, P. A. August 1997. Early warning of con£ict in Southern Lebanon usinghidden Markov models. Paper presented at the Annual Meeting of theAmerican Political Science Association, Washington D.C.

Schrodt, P. A., and D. J. Gerner, March 1998. An event data set for theArabian/Persian Gulf Region 1979^1997. Paper presented at the 1998 AnnualMeeting of the International Studies Association, Minneapolis, MN.

Sherman, F. L. 1988. Sherfacs: A new cross-paradigm, international con£ictdataset. Paper presented at the 1988 Annual Meeting of the InternationalStudies Association.

Srikant, R., and R. Agrawal. 1995. Mining generalized association rules. In Pro-ceedings of the 21st VLDB Conference, Zurich, Switzerland.

Srikant, R., and R. Agrawal. 1996. Mining sequential patterns: Generalizationsand performance improvements. In International Conference on ExtendingDatabase Technology (EDBT 1996).

Trappl, R., J. FÏrnkranz, and J. Petrak. 1996. Digging for peace: Using machinelearning methods for assessing international con£ict databases. In Proceed-ings of the 12th European Conference on Arti¢cial Intelligence (ECAI-96),Budapest.

668 K. KOVAR ET AL.

Dow

nloa

ded

by [

TO

BB

Eko

nom

i Ve

Tek

nolo

ji] a

t 02:

26 2

0 D

ecem

ber

2014

searching for patterns in political event sequences: experiments with the keds database

Documents