title page dummy - uppsala university · is. cassel, f. howar, b. jonsson, m. merten, b. steffen: a...

72
Title Page Dummy This Page will be Replaced before Printing Title page logo

Upload: others

Post on 07-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Title Page Dummy

This Page will be Replaced before Printing

Title page logo

Page 2: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

AbstractFormal models are often used to describe the behavior of a computer program or component.Behavioral models have many different usages, e.g., in model-based techniques for softwaredevelopment and verification, such as model checking and model based testing.

The application of model-based techniques is hampered by the current lack of adequatemodels for most actual systems, largely due to the significant manual effort typically needed toconstruct them. To remedy this, automata learning techniques (whereby models can be inferredby performing tests on a component) have been developed for finite automata that capture con-trol flow. However, many usages require models also to capture data flow, i.e., how behavioris affected by data values in method invocations and commands. Unfortunately, techniques areless developed for models that combine control flow with data.

In this thesis, we extend automata learning to infer automata models that capture both controlflow and data flow. We base our technique on a corresponding extension of classical automatatheory to capture data.

We define a formalism for register automata, a model that extends finite automata by addingregisters that can store data values and be used in guards and assignments on transitions. Ourformalism is parameterized on a theory, i.e., a set of relations on a data domain. We present aNerode congruence for the languages that our register automata can recognize, and provide aMyhill-Nerode theorem for constructing canonical register automata, thereby extending classi-cal automata theory to register automata.

We also present a learning algorithm for register automata: the new SL* algorithm, whichextends the well-known L* algorithm for finite automata. The SL* algorithm is based on ournew Nerode congruence, and uses a novel technique to infer symbolic data constraints on pa-rameters. We evaluated our algorithm in a black-box scenario, inferring, e.g., the connectionestablishment phase of TCP and a priority queue, in addition to a number of smaller models.The SL* algorithm is implemented in a tool, which allows for use in more realistic settings,e.g., where models have both input and output, and for directly inferring Java classes.

Page 3: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Nothing endures.

Page 4: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic
Page 5: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

List of papers

This thesis is based on the following papers, which are referred to in the textby their Roman numerals.

I S. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen:A succinct canonical register automaton model.In Journal of Logical and Algebraic Methods in Programming, Volume84, Issue 1, January 2015.

REVISED VERSION OF:A Succinct Canonical Register Automaton Model.In Proceedings of Automated Technology for Verification and Analysis,9th International Symposium, ATVA 2011, Taipei, Taiwan, October11-14, 2011.

II F. Howar, B. Steffen, B. Jonsson, S. Cassel:Inferring Canonical Register Automata.In Verification, Model Checking, and Abstract Interpretation - 13thInternational Conference, VMCAI 2012, Philadelphia, PA, USA,January 22-24, 2012.

III S. Cassel, F. Howar, B. Jonsson, B. Steffen:Active Learning for Extended Finite State Machines.Technical report 2015-032, Dept. of IT, Uppsala University, October2015.

REVISED VERSION OF:Learning Extended Finite State Machines.In Software Engineering and Formal Methods - 12th InternationalConference, SEFM 2014, Grenoble, France, September 1-5, 2014.

IV S. Cassel, F. Howar, B. Jonsson:RALib: A LearnLib extension for inferring EFSMs.In International Workshop on Design and Implementation of FormalTools and Systems, DIFTS 2015, Austin, TX, USA, September 26-27,2015.

Reprints were made with permission from the publishers.

Page 6: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic
Page 7: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Comments on my participation

Paper II participated in the discussions, in working out the proofs, and in writing thepaper.

Paper III participated in the discussions and writing of the paper.

Paper IIIThis paper originated from a discussion with a co-author about extending theapproach in Paper II. I participated in writing the paper and integrating thetheory with that in [30].

Paper IVI participated in the discussions and writing of the paper. The implementationwas done in collaboration with a co-author.

Page 8: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Other publications

F. Howar, B. Jonsson, M. Merten, S. Cassel:On Handling Data in Automata Learning - Considerations from theCONNECT Perspective.In Leveraging Applications of Formal Methods, Verification, and Validation -4th International Symposium on Leveraging Applications, ISoLA 2010, Her-aklion, Crete, Greece, October 18-21, 2010.

M. Merten, F. Howar, B. Steffen, S. Cassel, B. Jonsson:Demonstrating Learning of Register Automata.In Tools and Algorithms for the Construction and Analysis of Systems - 18thInternational Conference, TACAS 2012, Held as Part of the European JointConferences on Theory and Practice of Software, ETAPS 2012, Tallinn, Esto-nia, March 24 - April 1, 2012.

S. Cassel, B. Jonsson, F. Howar, B. Steffen:A Succinct Canonical Register Automaton Model for Data Domains withBinary Relations.In Automated Technology for Verification and Analysis - 10th InternationalSymposium, ATVA 2012, Thiruvananthapuram, Kerala, India, October 3-6,2012.

Page 9: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Summary in Swedish

Page 10: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic
Page 11: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.1 Behavioral models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Dynamic model generation (automata learning) . . . . . . . . . . . . . . . . . . . . . . . . . . 151.3 Research challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.4 In this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Finite state machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1 Finite automata and regular languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 Constructing a DFA from a regular language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Extended finite state machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1 Data languages and register automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Constructing a register automaton from a data language . . . . . . . . . . . . 30

4 Algorithms for automata learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.1 The L∗ algorithm: inferring finite automata models . . . . . . . . . . . . . . . . . . . . 374.2 The SL∗ algorithm: inferring register automata models . . . . . . . . . . . . . . 444.3 Automata learning in realistic scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Summaries of papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54I A succinct canonical register automaton model . . . . . . . . . . . . . . . . . . . . . . . . . . 54II Inferring Canonical Register Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55III Active Learning for Extended Finite State Machines . . . . . . . . . . . . . . . . . . 56IV RALib: A LearnLib extension for inferring EFSMs . . . . . . . . . . . . . . . . . . . . . 57

6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.1 Representing languages over infinite alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Learning extended finite state machines with L∗ . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Conclusion(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Page 12: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic
Page 13: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

1. Introduction

Suppose we are presented with a large, red box. The box has several buttons,and something that looks like a display. What is it, and what can we use it for?At this point, we cannot know. Perhaps it is a bomb, in which case it mightblow up at some point; perhaps it is a coffee machine, in which case it mightproduce a cup of espresso. We could start looking for an operating manualthat would tell us what we can do with the box and what kind of behavior weshould expect in response. But if there is no manual, perhaps there is a way tosomehow generate a description of the box and its behavior?

One way to figure out how the box behaves is to test it (in a controlledenvironment, obviously, in case it is a bomb). First, however, we need to findout how to interact with it. In this case, a natural way to interact with thebox is to use the buttons that we already noticed. Then, we are ready to testit by pushing different buttons and observing what happens. In order to beable to write a proper description, we also need to carefully write down thetests we perform and what the box does in response. If we choose the testsin a systematic way, we can produce a description of how the box behaves inalmost any situation.

This thesis is about generating descriptions of components (such as the redbox) or computer programs. Such descriptions describe how computer pro-grams (or red boxes) behave in response to certain inputs by you, its user. Byexamining such a description, we can sometimes discover that a program doessomething strange or unexpected. The descriptions can also be used for otherpurposes, for example, to facilitate the process of developing large, complexcomputer systems.

1.1 Behavioral modelsA description of how a computer program or component behaves is oftencalled a model. Models can illustrate different aspects or properties of pro-grams or components, such as their functionality or behavior. At the sametime, they abstract from less relevant aspects, omitting details accordingly.For example, a model of a coffee maker can show that the machine brews acup of cappuccino whenever the CAPPUCCINO button is pushed, a cup of mac-chiato whenever the MACCHIATO button is pushed, and otherwise stays idle. Itcan also show that the order of certain inputs to a component matter: perhaps

13

Page 14: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

pushing a + button immediately before the CAPPUCCINO button will give youan extra strong cappuccino, but pushing + immediately after the CAPPUCCINObutton will have no additional effect. However, the model will not include alldetails about the inputs, e.g., for how long a button was pushed.

Using modelsModels of computer programs can be used for many different purposes. A verypromising approach is model-driven engineering [66], which can be particu-larly useful when developing large or complex computer systems that consistof many different parts.

In model-driven engineering, models are used in each stage of the develop-ment process: to specify the intended behavior of a program, to describe thefunctionality of a program that has already been implemented, and to checkwhether the implementation contains unwanted errors or exhibits unexpectedbehavior. This has several benefits. First, using models to specify intendedbehavior forces engineers to consider the system they are developing at an ab-stract level, before they actually write the program code. This will often leadto better design decisions. Second, models that specify the intended behav-ior of some part of a system can sometimes be used to automatically generateprogram code. This means less manual effort is required, and also that thegenerated code can be known to be correct ’by construction’, provided thatthe code generation process does not introduce any errors. Third, models canmake it easier to check whether the developed system meets customer require-ments.

Models are also used in model-based testing [70, 27], where models areused as a basis for generating test cases, and in model checking [33, 13].In model checking, a model of a program is automatically checked to seewhether it satisfies certain desirable properties. In some sense, these proper-ties help define the intended behavior of the program. There are many modelchecking tools available, both for programs written in particular languages [14,26] (such as C++ or Java), for particular types of programs [10, 11, 43] andfor constructing particular kinds of models, e.g., models that take into ac-count real-time parameters or the probability of a system behaving in a certainway [52, 53, 54, 55].

Generating modelsA problem with models of programs or components is that they are not alwaysavailable. Creating models by hand is both difficult and time-consuming, es-pecially for complex computer programs or entire systems where many detailsneed to be taken into account. This means that models are often not createdat all, or only for particular parts of a larger system. Another problem is that

14

Page 15: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

models may be outdated or inaccurate, if the software has evolved since themodel was created. To address these problems, different techniques for auto-matically generating models with as little human effort as possible have beendeveloped.

Broadly speaking, there are two main approaches to the problem of au-tomatically generating models of already implemented programs. In staticanalysis, a model is created by analyzing source code. A prerequisite, in orderto use static analysis, is that source code is available. Static analysis can thenexploit the fact that the code provides complete information about the programor component’s behavior. This means that a model can be created without theprogram or component ever having to be tested or executed in practice. Staticanalysis is used in many different settings, for example, to find security flawsin programs before they are deployed at a larger scale [56], or to analyze med-ical devices to ensure that they will work as intended before they are used byreal humans [48].

In dynamic analysis, a set of tests is performed on the computer programor component. The results are collected, and compiled into a model that de-scribes the program or component, for example how it behaves in response tocertain input. The advantage of this approach is that there is no need to ac-cess the source code since we are only observing the behavior from outside.Like static analysis, dynamic analysis is also used in many different settings.There are tools available that, for example, can find invariants (i.e., statementsthat should always hold true) in programs, and discover security vulnerabili-ties in web applications. Dynamic analysis is often divided into white-box andblack-box techniques. In white-box techniques, the source code is accessible,so methods like symbolic execution can be used. In black-box techniques,there is no access to source code, and the only available part of the programor component is its external interface, i.e., what methods are available or whatmessages can be sent or received by the component or program.

1.2 Dynamic model generation (automata learning)Finite automata are often used to represent the control flow of a computer pro-gram or component, i.e., the order in which the program executes instructions,processes different input from its surroundings, etc. A finite automaton canbe drawn as a graph: a finite set of nodes connected by directed edges. Thenodes are typically called states, and the edges called transitions. Transitionsare labeled with symbols that represent instructions or inputs to the program.At any time, the automaton is in exactly one of the states, and by following atransition arrow, it can move to another state. A deterministic finite automatonis one where, in each state, there is only one possible transition per input.

We can also describe the control flow of a program or component in termsof a language. A language is a set of words, i.e., sequences of symbols. In

15

Page 16: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

this setting, symbols are inputs to the program, methods invoked on it, or com-mands sent to it. We can then define, e.g., the language of all words that do notcause the program to return an error message or go into an error state, or thelanguage of all words that let a user log in to some service provided by a com-ponent. Languages can be regular, which entails that they can be recognizedby a (deterministic) finite automaton. If we describe the control flow of a pro-gram or component as a language, then the finite automaton that recognizesthat language can be used as a model of that program or component.

An important property of finite automata models is canonicity. A particularlanguage can be represented as many different finite automata. However, foreach regular language there is exactly one minimal (in the number of states)deterministic finite automaton that can recognize it. This means that minimaldeterministic finite automata are canonical. Canonical deterministic finite au-tomata can be constructed using the Nerode congruence. This congruence par-titions words into a finite set of equivalence classes. The equivalence classescan be used to build a minimal finite automaton where each state representsan equivalence class. Canonical automata are much easier to compare to eachother, since we already know that if they differ somehow, they must representdifferent languages. Canonical models are important, e.g., for equivalence orrefinement checking of models, but also in some forms of dynamic analysis.

Automata learning is a particular form of dynamic analysis that learns reg-ular languages, and represents them as finite automata. If we can represent thebehavior of a component as a language, then we can use automata learningto infer a model of that language. First, a set of tests are performed on theprogram or component to be modeled, and the responses are collected. Then,a deterministic finite automaton is generated that represents the externally ob-servable behavior of the program or component (i.e., its language).

For example, suppose we have a digital camera with two buttons, labeledon/off and camera, respectively. We want to find out how to get it to take aphoto. The camera can be tested by pressing the buttons. For example, wecould test the following sequences and observe what happens:

• camera, camera : nothing happens• camera, on/off : nothing happens• on/off, camera : the camera takes a photo• on/off, on/off : the camera turns on, then off• on/off, camera, on/off : the camera takes a photo, then turns off• on/off, camera, camera : the camera takes two photos in sequence

By testing the camera in this manner, we can eventually construct a modelof the control flow of the camera. Figure 1.1 shows the result. The cameracan be in one of three states: it is either OFF, where photos cannot be taken,in STANDBY, where it is on but waiting for the camera button to be pressed,or in a state where it takes photos repeatedly whenever the camera button ispressed. The on/off button always returns the camera to the OFF state.

16

Page 17: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

OFF STANDBYTAKINGPHOTOS

camera(no output)

on/off

on/off

camera(no output)

on/offcamera(outputsphoto)

Figure 1.1. Small model of the control flow of a digital camera with two functions: itcan be turned on/off and photos can be taken using the camera button.

Two main flavors of automata learning are usually distinguished: passiveautomata learning and active automata learning. In passive automata learning,tests on the computer program or system we want to model have already beenperformed, and are provided in the form of traces, e.g., from system logs. Oneproblem with passive automata learning is that it is only as accurate as the setof traces will allow: if there is a particular behavior that the traces do not show,then this behavior will not appear in the constructed model. In active automatalearning, however, we choose which tests to perform on the computer programor system while we are constructing the model. The choice of which test toperform can thus depend on the results of previous tests. This allows us tocreate more accurate models, and also to refine existing models by performingadditional tests when necessary. It also means we can use fewer tests, sincewe can choose to only perform tests that will generate useful results.

Automata learning has been used in many different settings (see [36] foran overview), e.g., to generate models of communication protocols [18, 34]or Java programs [58, 72], and in security testing [32, 31, 68]. One commonaspect of these studies is that the target systems all expose some interface tothe outside world, which means we have an idea of how to interact with them.

A common algorithm for automata learning is the L∗ algorithm [12], whichinfers canonical finite automata based on the Nerode congruence. This algo-rithm has been implemented in several tools for automata learning, with thetwo main ones being LearnLib [47, 59, 63] and Tomte [1, 3].

1.3 Research challengesIf we are modeling the control flow of a computer program, we can let in-put symbols represent, e.g., methods that can be invoked on the program. Inthis way, the finite-state model represents constraints on the order in whichdifferent methods or commands can be executed on the program. However,sometimes we want to take into account not only a program’s control flow

17

Page 18: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

but also its data flow, i.e., relations between data parameters that are used inmethods or commands. To do this, we need a more expressive model thanfinite automata. One particularly interesting formalism is register automata,which have a finite set of states but are extended with a set of registers that canbe used to store data values. Transitions in a register automaton have guardsthat compare data parameters in input symbols to values stored in registers. Aregister automaton accepts or rejects a sequence of symbols, i.e., words, basedon relations between data values. Depending on which operations and tests wecan use in guards and assignments, register automata can recognize differentclasses of data languages. In this thesis, we capture this dependency by lettingregister automata be parameterized on a theory, i.e., a set of relations on somedata domain. Many register automata models only allow the theory of equal-ity on some domain, which imposes restrictions on what programs or systemsthey can model. Adding more relations would allow us to model more com-plex computer programs. For example, a register automaton that comparesdata values for inequality (<) could be used to model a program where inputvalues always have to be in a particular order, such as a priority queue datastructure. We want to adapt register automata to recognize a wider class oflanguages.

Recall that canonicity is an important property for finite automata, e.g., be-cause it allows for easier comparisons between models, and because the L∗

algorithm infers canonical finite automata. The latter is particularly interest-ing in our context. The concept of canonicity has, however, proven hard tocarry over to the setting of automata over infinite alphabets (such as, for ex-ample, timed automata). This is also true for register automata: when thereare formalisms for making canonical register automata models, they tend toimpose restrictions on, e.g., in what order they can process data symbols, oron the way that data values can be stored in registers. We want to invent acanonical register automata formalism, that avoids these restrictions. To dothis, we want to adapt the Nerode congruence for finite automata so that ittakes into account relations between data values. In addition to modeling for-malisms, we are also interested in a method for constructing the above modelsusing automata learning, particularly the L∗ algorithm. We want to extend andgeneralize it to make it usable for inferring more complex models. The nat-ural choice is to use register automata, for which we had an already definedcanonical formalism that allows comparing data values using binary relations.

Extending the L∗ algorithm to the register automaton setting comes withcertain challenges. For example, the algorithm needs to be adapted to usethe Nerode congruence for register automata. In order to do this, we need anew way to organize tests and test results, so that we can accurately capturerelations between data parameters. A new way to handle counterexamples alsoneeds to be devised, since the one in L∗ does not take into account relationsbetween data values.

18

Page 19: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Another challenge is to make the SL∗ algorithm more useful when it comesto inferring more realistic programs or components. Typically, programs orcomponents produce output that is non-binary (i.e., not just valid/invalid orOK/not OK). Register automata, however, only accept or reject words, so theycannot be used directly to model such programs or components. We want toadapt our learning algorithm for register automata to the input/output settingwithout having to reinvent a new algorithm. One additional problem is thatcomputer programs that produce output sometimes produce ’fresh’ (i.e., pre-viously unseen) output, and this also needs to be handled. We want to adaptregister automata learning to a more realistic setting, by providing a tool thatcan interface between the learning algorithm and real programs, such as Javaclasses.

In summary, we consider the following challenges:

• Defining a new register automaton formalism that is more succinct thanprevious proposals. Additionally, our register automata should be canon-ical, and have the ability to compare data values in input for equality andother binary relations, such as inequality (<).

• Defining a new active automata learning framework for our register au-tomata, based on the well-known L∗ algorithm.

• Implementing a tool for inferring register automata models in more real-istic settings, such as Java classes. This tool should have added featuresthat adapt our learning framework to handling input and output, as wellas some nondeterminism in output (i.e., fresh data values).

1.4 In this thesisThis thesis deals with research challenges within two main areas: finite au-tomata, and automata learning. Chapters 2 and 3 focus on challenges relatedto automata theory; Chapter 4, on challenges related to automata learning.

In Chapter 2, the challenge is to define a new register automaton model. Wefirst introduce the reader to the classical setting of finite automata and showhow they can be used to model different computer programs or systems. Wegive a description of regular languages and relate them to deterministic finiteautomata in Section 2.1. Then, we introduce the Nerode congruence and theMyhill-Nerode theorem for regular languages in Section 2.2. We show howthe congruence can be used to construct a deterministic finite automaton froma regular language, and discuss why it will work.

In Chapter 3, we introduce the reader to extended finite state machines. Wegive an example to illustrate the formalism. In particular, we discuss registerautomata and show how they can be used to recognize languages where datavalues can be compared using different relations. We describe our own Nerode

19

Page 20: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

congruence for register automata in Section 3.2 and also provide a Myhill-Nerode theorem.

In Chapter 4, the challenge is to define a new learning algorithm for regis-ter automata. We first give an introduction to active automata learning: Sec-tion 4.1 presents the L∗ algorithm for inferring deterministic finite automata.We discuss the particular setting in which L∗ works, and describe the algo-rithm in terms of a series of interactions between a Learner and a Teacher(both of which are abstract concepts). We relate the algorithm to the Nerodecongruence, on which it is based. We go through an example of inferring asmall deterministic finite state machine using the L∗ algorithm, and show thedifferent steps that are needed. Section 4.2 presents the SL∗ algorithm, whichis an extension of the L∗ algorithm to the setting where inferred models areregister automata. We describe this algorithm as well in terms of a series ofinteractions between a Teacher and a Learner, and discuss how it differs fromthe L∗ algorithm. We go through an example of inferring a small register au-tomaton model using our algorithm. In Section 4.3, we present RALib, a toolfor adapting the SL∗ algorithm for use in more realistic scenarios. We describethe main features of the tool, and how it can be used. We give some examplesof models that we can infer using our tool.

Following the background chapters, Chapter 5 summarizes each of the in-cluded papers. Chapter 6 then provides an overview of related work before weconclude in Chapter 7.

20

Page 21: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

2. Finite state machines

In this chapter, we introduce the reader to finite state machines and show howthey can be used to model computer programs or components. We define thetype of language that a finite state machine recognizes, and give a Nerode con-gruence and a Myhill-Nerode theorem allowing us to construct minimal finitestate machines. We then introduce extended finite state machines, in particu-lar register automata, and describe the languages that they recognize. We givea symbolic Nerode congruence and a Myhill-Nerode theorem for register au-tomata. The chapter thus describes the standard theory of finite automata (seealso, e.g., [45, 51]).

A finite state machine consumes a sequence of symbols (e.g., letters, num-bers, programming commands, ...) and produces some kind of output. Theinput symbols come from a finite set, called an alphabet. A finite-state ma-chine can be drawn as a set of states connected by transitions. The transitionsare labeled with symbols from the finite state machine’s input alphabet. Whileprocessing the symbol that labels a transition, the machine can move from onestate to another along the direction of the transition. States can be initial, final,both, or neither. Initial states are those where the finite state machine can be-gin processing a sequence of input symbols. Final states are those where thefinite state machine will stop and accept, if it has finished processing a word.

Example: a coffee makerWe use a coffee maker to demonstrate how a component can be modeled as afinite state machine. We assume that the coffee maker is very simplistic andonly has one button. Pushing the button will make the machine brew coffee,provided it has enough coffee beans and water stored to do so. After havingbrewed one cup, it must be refilled. We can say that the coffee maker has twoinput symbols in its alphabet, corresponding to the two things a user can do:refill coffee beans and push the button. Moreover, the coffee maker is alwaysin one of two states: STANDBY (waiting to brew coffee) or NEED_REFILL. Ifthe coffee maker is in the standby state, it produces output (coffee) wheneverthe input is push button and then moves to the NEED_REFILL state. To getback to the STANDBY state, we must refill beans and water. Figure 2.1 showsa model of the coffee maker. The start arrow corresponds to turning the appli-ance on, after which it alternates the two states depending on input, i.e., whatthe user does. �

21

Page 22: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

NEED_REFILL

STANDBY

start

refill coffee beans and waterpush button

Figure 2.1. Small model of a coffee maker

2.1 Finite automata and regular languagesWe will describe two kinds of finite state machines: finite automata, whereoutput is binary (e.g., ACCEPT or REJECT) and Mealy machines, where out-put is some sequence of symbols. A Mealy machine is a finite-state machinewith output that depends on the state the machine is in and what input it hasprocessed. On each transition, a Mealy machine processes an input symboland produces an output symbol. In a finite automaton, on the other hand, out-put is only produced after the automaton has processed an entire word. Allstates are marked either accepting or rejecting. Whenever a finite automatonhas finished processing a word, it accepts if it is in a final state, and rejectsotherwise. In a deterministic finite automaton (DFA), each state has exactlyone transition (either to another state or back to itself) for each symbol in theinput alphabet. Furthermore, there is only one initial state. Formally, a DFA isdefined as follows:

Definition 1 (DFA). A DFA is a tuple 〈Σ,Q,q0,F,T〉, where• Σ is a finite input alphabet,• Q is a finite set of states,• q0 ∈ Q is the initial state,• F ⊆ Q is the set of final states, and• T is a subset of Q×Σ×Q. �

Each transition in a DFA is thus of form 〈q,a,q′〉, where q and q′ are lo-cations in Q, and a is an input symbol. The automaton moves from q to q′

on input a iff the transition 〈q,a,q′〉 is in T . Since the DFA is deterministic,for any state q and input symbol a there is exactly one transition 〈q,a,q′〉 ∈ T(assuming that it is completely specified).

22

Page 23: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Regular languagesA language, in the formal sense, is a set of words over an alphabet. A regularlanguage is a language that can be defined using a regular expression. Regularexpressions describe patterns over words; for example, the regular expressionabb* defines the set of words that start with one a and then continue withat least one b. Thus {ab*} is a regular language. The words a and ab, forexample, are words in {ab*} whereas the words bb and aa, for example, arenot.

A language is regular if and only if it is recognized by a finite automaton.For this reason, we often call finite automata acceptors of regular languages.An acceptor that recognizes a language will accept all words in the languageand reject all other words.

q0q1

q2

q3

a

ba

b

a

b

a,b

Figure 2.2. A DFA for {a,b} that accepts ab followed by one or more b.

Example: a deterministic finite automaton (DFA)Figure 2.2 shows a deterministic finite automaton with four states, numberedq0 through q3. State q0 is initial (indicated by an arrow). State q2 is accepting(indicated by a double circle); all other states are rejecting. State q3 is a deadend (or a sink state), since there is no way to reach the accepting state from it.For this reason, we draw transitions leading to the sink state as dotted lines.

The machine accepts words that start with the sequence ab and continuewith zero or more b symbols, i.e., the regular language defined by the regularexpression abb*. In the remainder of this chapter, we will refer to this lan-guage as Labb*. For example, the word abbb is accepted: the machine startsin q0, moves to q1 by processing a, then moves to q2 by processing b, and staysthere while processing the last two b symbols. The sequence aa is rejected:the machine starts in q0, moves to q1 by processing the first a, but then movesto the sink state q3 by processing the second a. �

23

Page 24: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

2.2 Constructing a DFA from a regular languageFor each regular language, there is one particular DFA that is minimal, i.e.,it has just enough states to recognize the language. In some sense, the min-imal DFA is the most ’efficient’ among all DFAs that recognize the regularlanguage, since no other such DFA will have fewer states. The Myhill-Nerodetheorem [60, 61], due independently to Anil Nerode and James Myhill in the1950s, provides a way to construct minimal DFA directly from a regular lan-guage. Let us see how this can be done.

Recall that a DFA must, by definition, have a finite set of states. Each stateis either accepting or rejecting, denoting whether all words that lead to it arein, or not in, the language that the DFA recognizes. In order to construct a DFAthat recognizes a regular language, we need to determine a set of states that isboth necessary and sufficient in order to recognize the regular language. If weuse too many states (for example, one state for each possible word over thealphabet), then the DFA will be unnecessarily large or even infinite (in whichcase it would violate the definition of a DFA). If on the other hand we use toofew states, then the DFA will not be able to recognize the language correctly.

To obtain a finite, minimal number of states from a regular language, weintroduce an equivalence relation on words. An equivalence relation parti-tions a set of elements into a number of equivalence classes. We will define aparticular equivalence relation on words, called the Nerode congruence.

Definition 2 (Nerode congruence). Let L be a language over the alphabetΣ, and let u and v be words from Σ∗. Then, u and v are Nerode-equivalent,denoted u ≡L u′ if for any word w ∈ Σ∗, we have that uw is in L if and only ifvw is in L. �

According to the above definition, any language L induces an equivalencerelation on the set of words. Two words are equivalent with respect to L ifand only if, when extended by any same suffix, the extended words are eitherboth in the language or both not in the language. Thus, if two words areequivalent according to this definition, it means that they will both alwaysexhibit the same behavior when extended by the same suffix. Conversely,if two words are not equivalent, there is a distinguishing suffix that, whenappended to both sequences, will result in one of the extended sequences beingin the language and the other not being in the language. For example, recall thelanguage Labb*. We have that ab ≡Labb* abb: both words are in the language,and regardless of what suffix we extend them with, they will always behave thesame. On the other hand, we have that a .Labb* b: they can be distinguishedby, e.g., the suffix b, since ab is a word in the language, whereas bb is not.

For each equivalence class of the Nerode congruence, there is a residuallanguage. The residual language of a language L with respect to a word u,denoted u−1L, is the set of all words v such that uv is in L. For example, in

24

Page 25: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

the language Labb*, the word a has the residual language a−1Labb* = {bb*},and the word b has the residual language b−1Labb* = {}, since no word thatstarts with b is in L. We observe that if two words u and w have equal residuallanguages, i.e., if u−1L = w−1L, then there can be no distinguishing suffixbetween them, i.e., they must be Nerode-equivalent with respect to L. Thusu−1L = w−1L implies u ≡L w.

In a minimal DFA, two words with identical residual languages will alwayslead to the same state, since there is no distinguishing suffix between them, i.e.,they will always behave the same when extended by the same suffix. More-over, two words u and w whose residual languages differ will always lead todistinct states. In this case, there is at least one suffix v for which uv ∈ L butwv < L (or vice versa). The suffix v is then a distinguishing suffix for u and wwith respect to L, so both words cannot lead to the same state.

Provided that the Nerode congruence has finitely many equivalence classes,we can use the Myhill-Nerode theorem to build a DFA where each state di-rectly corresponds to a Nerode-equivalence class. The DFA is guaranteed bythe theorem to be minimal. Conversely, if the Nerode congruence does nothave finitely many classes, then it is not possible to build a DFA that recog-nizes the language. This, in turn, means that the language is not regular.

Theorem 1 (Myhill-Nerode). Let L be a regular language over an alphabetΣ, and ≡L the induced Nerode congruence on Σ∗ with respect to L. L isregular if and only if the number of equivalence classes of ≡L is finite. Theminimal DFA that recognizes L has exactly as many states as the number ofequivalence classes of ≡L . �

For example, consider again the language Labb*. The equivalence classesof ≡Labb* are four, each defined by a regular expression:

• ε ,• a,• ab* (the equivalence class of all words that begin with a and continue

with at least one b), and• (aa* | ba* | bb*) (the equivalence class of all words that begin withaa or with b).

It can easily be seen that each of the four classes corresponds to a state inthe DFA in Figure 2.2.

The Myhill-Nerode theorem can be used to show that languages are regular(or not), but more importantly for this thesis, it can be used as a basis forautomata learning algorithms (Chapter 4).

25

Page 26: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

3. Extended finite state machines

More complex components cannot always be adequately and succinctly repre-sented by finite state machines: if a component’s input alphabet is very large,a finite state machine model of it will become infeasibly large, and if the in-put alphabet is infinite, the component cannot be represented as a finite statemachine at all.

To address this problem, we can instead use extended finite state machines.Extended finite state machines focus on relations between input symbols ratherthan on their concrete values. They are finite state machines extended withmemory: they have assignments, and guards over input symbols.

Example: a coffee maker with memorySuppose we have another coffee maker, more advanced than our previous one.The new machine can brew several cups of coffee at a time, so in addition tothe one button for making coffee, it also has a keypad where users can enterthe desired number of cups. The advanced coffee maker stores coffee beansand water for multiple cups, and when processing user input, it takes intoconsideration how much coffee beans and water it is currently storing. If thenumber of cups input by the user is more than the current amount of waterand coffee beans will allow, the machine will ask for a refill; otherwise, it willbrew coffee. We assume that the coffee maker can store a maximum of tencups’ worth of water and coffee beans.

The advanced coffee maker cannot be modeled as an FSM, since there areinfinitely many combinations of digits that can be entered via the keypad, andeach of these combinations would need to be represented as a separate transi-tion. To model this coffee maker using a finite number of states requires a moreadvanced formalism that has memory in addition to states and transitions.

Figure 3.1 shows a model of the advanced coffee maker. It is also always inone of two states: STANDBY and NEED_REFILL. It stays in the STANDBY statewhile brewing coffee, until its store of water and coffee beans has been de-pleted; then, it moves to the NEED_REFILL state. To get back to the STANDBYstate, we must refill it with some amount of beans and water.

The start arrow corresponds to turning the coffee maker on, and setting theinitial quantity of coffee beans and water. Following this, the coffee makeralternates the two states depending on what we do to it. In order for the ap-pliance to function properly, it must always remember how much water and

26

Page 27: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

coffee beans it has stored, and be able to update this information as coffee isbrewed and water and coffee beans refilled. The coffee maker’s memory isrepresented as a gray square in the figure, and it is updated by every transition.

NEED_REFILL

STANDBY

MEMORY# of availablecoffee units

startmemory assignment:

set initial value

refill coffee beans and watermemory assignment: increase by npush button for n cups of coffee

guard: at least n units availablememory assignment: decrease by n

push button for n cups of coffeeguard: less than n units available

Figure 3.1. Small model of a coffee maker with memory

3.1 Data languages and register automataIn this thesis, we deal with a particular kind of extended finite state machines,called register automata. Before we discuss register automata, let us intro-duce a type of formal language that can represent relations between alphabetsymbols.

A data symbol is of the form a(d), where a is an action from a finite set Aof actions, and d is a data value from a possibly unbounded domain D of datavalues. Here, we assume that a data symbol only has one data value, but ourresults extend to the case where data symbols carry more than one data value.We use the term data word to denote a sequence of data symbols.

In order to represent relations between data values in a data word, we in-troduce a theory. A theory is a pair 〈D,R〉, i.e., a set R of relations on a datadomainD. The theory specifies what types of data values (e.g., integers or realnumbers), and what relations between data values (e.g., equality or inequality)are used to determine membership in a particular data language. It also meansthat whenever two data words have the same relations between data values,they will either both be in the data language or both not in the data language.

27

Page 28: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

For example, if the set of in a theory is empty, then membership in a datalanguage is only based on the actions in data symbols, similar to a DFA. Onthe other hand, if the set of relations is equality, then membership is based onwhether data values are equal or not. We formalize this as follows:

Definition 3 (R-indistinguishable data words). Two data wordsw = α1(d1) . . . αn (dn ) and w′ = α′1(d ′1) . . . α′n (d ′n ) are R-indistinguishable if

• α1 . . . αn = α′1 . . . α′n , i.e., they have the same sequences of actions, and

• R(di , . . . , d j ) if and only if R(d ′i ,. . . , d ′j ) whenever R ∈ R and i, . . . , jare indices between 1 and n. �

For example, by using the theory of equality over integers, we can distin-guish between the words a(1)a(4)b(1) and a(3)a(3)b(3) where, in theone case, only the first and third data values are equal, but in the other case alldata values are equal. However, the words a(1)a(4)b(1) and a(5)a(2)b(5)are indistinguishable with respect to the equality relation since, in both cases,the first and third data values are equal. If instead we use the theory of equalityand inequality (<) over integers, we can also distinguish between the wordsa(1)a(4)b(1) and a(5)a(2)b(5): in the one case, the second data value isbigger than the first and the third data values, but in the other case, the seconddata value is smaller than the first and the third data values.

A data language is a set of words that respects the set R of relations, in thesense that whenever two data words are R-indistinguishable, they are eitherboth in the data language or both not in the data language. If a data languagerespects the set R of relations on a data domain D, we say that it is parame-terized on the theory 〈D,R〉.

We are now ready to introduce register automata. We will use them asacceptors for data languages, and, in this thesis, limit ourselves to determin-istic register autoamta. We refer to the languages that are recognizable by adeterministic register automata as regular data languages. Formally, a deter-ministic register automaton can be defined as follows:

Definition 4 (Register automaton). A register automaton is a tuple 〈L,l0,F,X,Γ〉,where

• L is a finite set of locations,• l0 is the initial location,• F ⊆ L is the set of final (accepting) locations,• each location l has a finite set X(l) of registers, and• Γ is a finite set of transitions, each of form 〈l,α(p),g,π,l ′〉 where

– l,l ′ ∈ L are the source and target locations, respectively,– α(p) is a parameterized input symbol with the parameter p,– g is a guard over p and X(l), and

28

Page 29: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

– π is the assignment, a mapping fromX(l ′) toX(l)∪{p}. Intuitively,the register x ∈ X(l) is assigned the value of π(x), which is eitherthe parameter p or some register x ′ ∈ X(l). �

Example: a register automatonA register automaton can be represented as a set of locations with transitionsbetween them. Locations have registers that can store data values. Figure 3.2shows a register automaton with four locations, numbered l0 through l3. l0is the initial location (indicated by an arrow); all locations except l3 are final(indicated by a double circle). Location l3 is a sink state. All locations exceptthe initial location have a register x.

Transitions are labeled with parameterized symbols and guards over pa-rameters and registers. Transitions that lead to the sink location are denotedby dotted lines. Parameters are placeholders for data values, so parameterizedinput symbols represent data symbols. Guards represent conditions on datavalues. For example, the guard on the transition that loops in l1 states that thedata value occurring in an a-symbol must be at least as big as the data valuestored in the register x.

The register automaton processes data symbols with actions from the setA = {a} and data values from the domain of natural numbers. Thus, it acceptswords of form a(d1). . .a(dn) under certain conditions. The data language Lconsists of words of form a(d1). . .a(dn), such that whenever di > di+1 forsome i ≤ n−2, then di+1 ≤ di+2.

When the register automaton processes a data symbol, it matches the datasymbol with a parameterized input symbol, checks which transition guard issatisfied by the symbol’s data value, and moves along that transition. Let usnow see how the register automaton processes a data word.

On the first transition from l0 to l1, the data value occurring in an a-symbolis stored in x. The register is then updated on transitions to and from locationl2. The loop transition in location l1 is taken whenever the data value occurringin an a-symbol is at least as big as the one currently stored in x. For example,the word a(1)a(4)a(0)a(7) is accepted: the automaton starts in l0 where itstores 1 in the register x. Then it loops in location l1 and stores 4, after whichit moves to l2 by comparing 0 to the value stored in x (4). It also updates x to0. Finally, the automaton moves back to l1 since 7 is bigger than the currentlystored value 0.

We use Lra to denote the regular data language recognized by this registerautomaton. �

The model in Figure 3.2 corresponds to the register automata presented inPaper III. In Paper I, we used slightly different terminology 1 to present a reg-

1Paper I does not use the notion of a theory (since only equality is considered) and describesdata words in terms of constraints on their data values.

29

Page 30: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

l0 l1 l2

l3

a(p)x := p

a(p)|p < xx := p

a(p)|x ≤ px := p

a(p)|x ≤ px := p

a(p)

a(p)|p < x

Figure 3.2. A register automaton that accepts words of the form a(d1)a(d2). . . undercertain conditions.

ister automaton model for the theory of equality over different data domains.There are other similar register automata models (e.g., [15, 49]) that also re-strict possible relations between data values to only equality, and otherwisediffer slightly, e.g., in whether registers are initially empty or not, and in whatvalues are stored in registers during a run of the automaton.

3.2 Constructing a register automaton from a datalanguage

Recall that a register automaton accepts or rejects words in a data languagebased on relations between data values. In some sense, we can say that itprocesses words at a symbolic level. In Section 2.2, we described a way toconstruct a finite automaton from a regular language by means of the Myhill-Nerode theorem. Now, we would like to use a similar approach to constructregister automata from a data language. More precisely, we want to definean equivalence relation that classifies prefixes based on whether the result ofconcatenating them with a particular suffix is in the data language or not. Thiswill allow us to construct locations in a register automaton, but we also need tobe able to determine which transition guards to use, and when and how to storedata values in registers. Unfortunately, the Nerode congruence in Definition 2will not work directly; we need to adapt it to the register automata setting, asthe following example shows.

Consider the data language Lra accepted by the register automaton in Fig-ure 3.2. We choose two sequences u and v of data symbols that lead to thesame location in the register automaton, and see if we can find a distinguishingsuffix. Since the Nerode congruence is used to construct minimal automata,we should not be able to find such a distinguishing suffix if the congruence isdirectly adaptable to the register automaton setting.

30

Page 31: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

For example, let u = a(5)a(2) and v = a(8)a(4). According to Figure 3.2,both these sequences lead to location l2, with x storing 2 and 4, respectively.However, we can easily find a suffix w that distinguishes them, for example,w = a(3). In the automaton, uw = a(5)a(2)a(3) leads to location l1 whereasvw = a(8)a(4)a(3) leads to the sink location. In fact, by applying an equiv-alence relation similar to the Nerode congruence for finite automata, almostall words could become inequivalent. The reason for this is that by comparingtwo sequences of symbols with respect to an individual suffix, we cannot cap-ture relations between data values: in the one case, 3 is smaller than the valuecurrently stored in x; in the other case, 3 is bigger. Since register automataaccept or reject words based on relations between data values, we need to usea different notion of ’suffix’ for comparing two words, and amend the Nerodecongruence accordingly.

Symbolic decision treesWe will define a symbolic version of a suffix, essentially consisting of severalsuffixes together, represented as different relations between data values. Wecall these symbolic decision trees. Symbolic decision trees are used instead ofsuffixes in order to adapt Definition 2 to data languages.

In the setting of regular languages, the Nerode congruence defines twowords as equivalent if there is no suffix that can distinguish them. In thissetting, as we have seen, one suffix is not sufficient to distinguish two datawords. Thus, we can instead consider a parameterized suffix (e.g., of formα1(p1) . . . αn (pn ) and consider how relations between the parameters p1, p2and the data values in the prefix affect whether the resulting data word is inthe data language or not. We often refer to the sequence of actions in a suffix(concrete or parameterized) as a symbolic suffix.

A symbolic decision tree is essentially a restricted form of a register au-tomaton in the shape of a tree. It has locations with registers, and transitionswith guards and assignments. A (u,V )-tree is a symbolic decision tree thatmodels the acceptance of a specific set of suffixes after a given prefix u. For-mally, we define a (u,V )-tree as follows:

Definition 5 ((u,V )-tree). Let u be a data word and let V be a set of symbolicsuffixes. A (u,V )-tree is a symbolic decision tree T , which, for any suffix vwhose sequence of actions is in V , either accepts uv or rejects uv. �

Figure 3.3 shows a (u,V )-tree with two locations: a root location at the topand a leaf location at the bottom. Both locations are accepting (denoted bydouble circles). The tree starts processing input in the root location, againsimilar to a register automaton. It accepts suffixes of form a(d), regardless ofthe value of d, e.g., a(2) or a(1).

31

Page 32: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

a(p)

Figure 3.3. A symbolic decision tree that accepts a(d) for any d.

The tree in Figure 3.4(a) is slightly more complex, with three locations andone register in the root location, x2. It accepts suffixes of form a(d) wheneverd is at least as big as the value stored in x2, and otherwise rejects.

Whenever a (u,V )-tree has registers in the root location, these registers storevalues from the prefix u. This enables the tree to model relations between datavalues in u and data parameters in the set of suffixes, by means of guards andassignments. We introduce a technical restriction on registers, namely that aregister xi may only store the ith data value in a word (either from the prefixor suffix). This makes it easier to compare (u,V )-trees to each other when theprefix u is not the same for both trees.

We use νu to denote the valuation of registers that a particular prefix uinduces. For example, a (u,V )-tree for u = a(8)a(4) and V = {a(d)} will lookexactly like the tree in Figure 3.4(a). In this case, νu specifies that the value ofx2 is 4 (since x2 stores the second data value from u).

Two (u,V )-trees T and T ′ for the same set V of symbolic suffixes areisomorphic, denoted T ≡ T ′ if their locations, transitions, and registers arethe same. For a bijection γ between the registers of T and the registers ofT ′, the trees T and T ′ are isomorphic under γ, denoted T ≡γ T ′ if they areisomorphic when renaming all registers according to γ. For example, the treesin Figure 3.4(a) and 3.4(b) are isomorphic under γ whenever γ(x2) = x3.

x2

a(p)|p < x2 a(p)|x2 ≤ p

(a) A symbolic decision tree that acceptsa(d) whenever d is at least as big as thevalue stored in the register x2.

x3

a(p)|p < x3 a(p)|x3 ≤ p

(b) A symbolic decision tree that acceptsa(d) whenever d is at least as big as thevalue stored in the register x3.

Figure 3.4. Symbolic decision trees ((u,V )-trees) for V = {a} and different u.

Symbolic decision trees can be refined by adding more symbolic suffixes toV . For example, the (u,V )-tree in Figure 3.5 refines the tree in Figure 3.3: Aregister x1 has been added, since in the refined tree, guards refer to data valuesin the prefix. The tree has also increased in length, since in the refined treewe consider suffixes of length 2. The initial branch from the root location hasbeen split in two, where the left branch represents the case when the data value

32

Page 33: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

x1

a(p) | p < x1x2 := p a(p) | x1 ≤ p

a(p)a(p) | p < x2a(p) | x2 ≤ p

Figure 3.5. A symbolic decision tree that accepts or rejects a(d)a(d) under certainconditions.

in a(d1) is smaller than the one stored in x1 and the right branch representsthe case when the data value in a(d1) is at least as big as the stored value. Therefined tree as a whole distinguishes three different cases, represented by thethree leaf nodes:

• The middle leaf represents the case where the prefix a(2) is followed byconsecutively smaller d1 and d2. For example, the word a(2)a(1)a(0)would reach this leaf.

• The left leaf represents the case where the prefix a(2) is followed by asmaller d1 and then a bigger d2. For example, the word a(2)a(0)a(1)would reach this leaf.

• The right leaf represents the case where the prefix a(2) is followed by abigger or equal d1 and then any d2. For example, the word a(2)a(3)a(2)would reach this leaf.

Tree oraclesWe already have the notion of a symbolic decision tree for a set of symbolicsuffixes after a given prefix. Next, we need to ensure that the tree conforms to aparticular data language, i.e., that whenever a word uv is in the data language,a (u,V ) tree will accept uv if v is in V . We use a function called a tree oracleto do this. A tree oracle is formally defined as follows:

Definition 6 (Tree oracle). Let 〈D,R〉 be a theory. A tree oracle for 〈D,R〉is a function O that, given a data language L, a data word u, and a set V ofsymbolic suffixes, returns a (u,V )-tree OL (u,V ). This tree has the propertythat for any data word uv ∈ V , v is accepted by OL (u,V ) under νu if and onlyif uv ∈ L, and rejected by OL (u,V ) under νu if and only if uv < L. �

The tree oracle generates a (u,V )-tree that conforms to a particular languageL. However, Definition 6 in principle does not specify exactly what a treeshould look like for a given u, V , and L. This can be problematic when wewant to use the trees to construct a canonical register automaton. In particular,

33

Page 34: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

we want trees to always look the same when constructed for the same u, V , andL, and we want to impose some restrictions on which guards and assignmentsare allowed. One prerequisite for this is the existence of a representative datavalue for each guard in a tree, which allows us to generate exactly one concretedata word for each sequence of guarded transitions in a (u,V )-tree. We definea representative data value as follows:

Definition 7 (Representative data value). For a guard g and a prefix u, a rep-resentative data value dg

u is a data value with the following properties:• it satisfies g under the valuation νu of registers induced by u, and• whenever g is refined to become h, then dh

u = dgu provided that dg

u satis-fies h under νu . �

We are now ready to define canonical tree oracles.

Definition 8 (Canonical tree oracle). The tree oracle OL is canonical if itsatisfies the following conditions for any language L, prefix u, and set V ofsymbolic suffixes:

• Let OL (u,V )[l] denote the subtree of OL (u,V ) starting in location l.For each initial transition 〈l0,α(p),g,π,l〉 in OL (u,V ), we have thatOL (u,V )[l] ≡ OL (uα(dg

u ),W ) where W contains all symbolic suffixesin V without the first action α.

• Whenever V ⊆ V ′, then– for any u, u′, and γ, if OL (u,V ′) ≡γ OL (u′,V ′) then OL (u,V ) ≡γ

OL (u′,V ),– for each initial α-guard g in OL (u,V ) there is an initial α-guard h

in OL (u,V ′) with νu |= h[dgu/p] and νu |= h −→ g, and

– the set of registers in OL (u,V ) is a subset of the set of registers inOL (u,V ′). �

The definition states the properties that are necessary for a tree oracle to becanonical. First, the oracle constructs trees recursively. Second, by extendingthe set of symbolic suffixes we cannot make two previously inequivalent sym-bolic decision trees equivalent. Third, whenever we extend the set of symbolicsuffixes, the guards of the initial transitions either stay the same or are refined;the set of registers is either the same or extended.

To give an idea of how symbolic decision trees can be generated by a treeoracle, we present the following example.

Example: how to generate symbolic decision treesConsider the theory of equality and inequality over natural numbers, and thetwo words u= a(5)a(2) and v = a(8)a(4). With respect to the set R = {=, <} of

34

Page 35: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

relations, the words u and v are R-indistinguishable. We can compare them ata symbolic level, i.e., focusing on the relations between data values, by doingthe following:

• We extend each word with a set of carefully chosen suffixes. By ’care-fully chosen’, we mean one suffix from each class of R-indistinguishablesuffixes, i.e., suffixes that represent the relations in the theory on whichthe particular data language is parameterized. For the language L, theset R of relations is {=, <}. Thus, there are five possible suffixes withwhich each word will need to be extended: after a(8)a(4), for exam-ple, these can be a(3), a(4), a(6), a(8), and a(9). Each of thesesuffixes represents a unique relation between the data value in the suf-fix and the data values in the prefix. For example, a(6) represents thecase where the data value in the suffix is smaller than the first data valuein a(8)a(4), but bigger than the second one. (In this case, we mightof course just as well have chosen a(5), since a(8)a(4)a(5) is R-indistinguishable from a(8)a(4)a(6).)

• We group the suffixes according to whether the corresponding extendedwords are in the data language or not. For example, after a(8)a(4), thesuffix a(3) is rejected, whereas a(4), a(6), a(8) and a(9) are all ac-cepted.

• We organize the groups of suffixes in a tree structure, and label eachbranch with the appropriate relation(s) between data parameters. Reg-isters in the root location store data values from the original word. Forexample, after the word a(8)a(4), we can use a register x2 to denote thestored value 4. The accepted suffixes a(4), a(6), a(8) and a(9) aftera(8)a(4) can be collectively represented as x2 ≤ p, where p representsthe data value occurring in the symbol a(d). The remaining rejected suf-fix can be represented as p < x2. We leave it as an exercise to the readerto verify that the same holds for the set of five suffixes after a(5)b(2),i.e., that they can be represented using the same tree, modulo the valuestored in x2. �

When determining the guards with which to label branches in a tree, we userelations that are as weak as possible. This has the effect that we will only storedata values from the prefix in registers if they are necessary for determiningacceptance or rejection of a suffix. In this case, for example, we do not needany relation between the first data value in the prefix and data parameters inthe suffix, since it does not matter for acceptance or rejection of the suffixes.

We are now ready to define a symbolic version of the Nerode congruenceusing symbolic decision trees2.

2This is a simplified version of the congruence defined in Paper III, which takes into accountparticular properties of theories.

35

Page 36: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Definition 9 (Symbolic Nerode congruence). Let O be a canonical tree oraclefor a theory 〈D,R〉. Let u and u′ be data words. For a language L, the wordsu and u′ are Nerode-equivalent if for any set V of symbolic suffixes, thereis a bijection ρ between the registers in OL (u,V ) and OL (u′,V ) such thatOL (u,V ) ≡ρ OL (u′,V ) �

The symbolic Nerode congruence states that two data words are equivalentwith respect to a data language if and only if their symbolic decision treesare isomorphic under some bijection between registers when constructed fromthe same set of symbolic suffixes. In other words, if two data words u and u′

are equivalent according to this definition, it means that whenever a word v isaccepted (rejected) by the symbolic decision tree after u, then any word v′ thatis R-indistinguishable from v is accepted (rejected) by the symbolic decisiontree after u′ (and vice versa).

Conversely, if two data words are not equivalent, there is a set V of sym-bolic suffixes for which the constructed symbolic decision trees cannot bemade isomorphic under any renaming of registers. Much in the same wayas the original Nerode congruence, the symbolic version can thus be used toseparate locations when building a register automaton. Provided that the sym-bolic Nerode congruence has finitely many equivalence classes, we can usea symbolic version of the Myhill-Nerode theorem to build a register automa-ton where each state directly corresponds to (at least) one symbolic Nerode-equivalence class. This register automaton will not necessarily be minimal,but canonical. Conversely, if the symbolic Nerode congruence does not havefinitely many equivalence classes, then we cannot build a register automatonthat recognizes the data language.

Theorem 2 (Symbolic Myhill-Nerode). Let L be a data language, and ≡La symbolic Nerode congruence on L. If L is regular3, then we can build aregister automaton that recognizes L. This automaton will have at most asmany locations as the number of equivalence classes of ≡L . �

We use the symbolic Myhill-Nerode theorem as a basis for the symbolicautomata learning algorithm presented in Section 4.2. A proof for this theoremcan be found in Paper III.

3This is with respect to a particular tree oracle, not necessarily in general. See Paper III fordetails.

36

Page 37: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

4. Algorithms for automata learning

Finite automata models of (regular) languages can be inferred using a form ofmachine learning called concept learning. A concept in this sense is a set ofobjects from some domain. In concept learning, or learning from examples,the task is to infer a model that represents the unknown concept. In order todo this, we need a set of examples from the domain. This set is called trainingdata. Each example in the training data has a label providing informationabout whether it belongs to the concept we are learning or not.

There are two main approaches to concept learning, differing in the princi-ples for labeling training data. In passive learning, all training data has beenlabeled beforehand and no additional training data is available. In active learn-ing, some training data may be labeled initially, but not all of it. Instead, train-ing data is labeled when the particular learning algorithm chooses to do so. Inthis thesis, we focus on active concept learning.

Active learning in general, and active concept learning in particular, is oftenformulated as an interaction between a Learner and a Teacher. The Teacherhas knowledge about the unknown concept. The Learner can make queries tothe Teacher about the concept, and the Teacher answers these correctly. TheLearner uses this information to construct a representation of the concept.

In a minimally adequate teacher (MAT) setting [12], the Teacher can an-swer two kinds of queries: membership queries and equivalence queries. Amembership query consists in asking for the label of a particular example fromthe domain. An equivalence query is made when the Learner has constructeda hypothesis of the concept, and consists in asking whether this hypothesis isthe correct representation of the concept or not. If not, the Teacher supplies acounterexample: an example that is not part of the concept but in the Learner’shypothesis, or vice versa. The efficiency of algorithms that work in the MATsetting is typically measured in terms of how many membership queries andequivalence queries are needed in order to infer a representation of a particularconcept.

4.1 The L∗ algorithm: inferring finite automata modelsActive concept learning of regular languages, represented as finite automata, isoften just called active automata learning. In this setting, the domain is Σ∗, theset of words over a finite alphabet Σ. The concept is a regular language over Σ.The most commonly used algorithm for active automata learning is L∗ [12],

37

Page 38: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

developed by Dana Angluin in the 1980s. Since then, active automata learninghas been used in a number of different applications (for an overview, see [36]),and a number of different variations have been developed, for example theoptimization by Rivest and Schapire [64]. In this exposition, we will describetheir optimization.

The L∗ algorithm infers a minimal deterministic finite automaton that rec-ognizes a particular regular language over a finite alphabet. We refer to thislanguage as the target language. In the L∗ algorithm, a membership queryconsists in whether a word w is in the target language or not. The answer iseither +, if it is, and −, otherwise. An equivalence query consists in askingwhether a hypothesis DFA constructed by the Learner is correct or not. Theanswer is either yes, or a counterexample, which is a word on which the hy-pothesis DFA and the Teacher disagree, i.e., a word that is either in the targetlanguage but rejected by the hypothesis DFA, or vice versa.

Conceptually, L∗ is based on the Nerode congruence (Section 2.2), and onpartition refinement. The Nerode congruence partitions the set of words inΣ∗ into equivalence classes. The algorithm maintains an approximation of theNerode congruence, which is gradually refined as the algorithm progresses.Initially, the Learner assumes that all words over the alphabet are Nerode-equivalent, i.e., that there are no distinguishing suffixes. By interacting withthe Teacher, the Learner progressively discovers which words can be sepa-rated by means of distinguishing suffixes. The Learner also discovers newequivalence classes. She refines her approximation of the Nerode congruenceaccordingly, until she is able to build a hypothesis automaton using the equiv-alence classes of the congruence to represent distinct states.

Observation tableIn order to construct a hypothesis automaton, the Learner uses an observationtable. An observation table organizes the information obtained by the Learnerfrom making queries, and represents the current approximation of the Nerodecongruence. Provided that the table satisfies certain properties, it can be usedto directly construct an automaton. The table is maintained by the Learner andcontinually extended as the algorithm proceeds. Let us see how this works,using the example language Labb* recognized by the DFA in Figure 2.2.

Figure 4.1 shows an observation table. Rows are labeled with prefixes;columns are labeled with suffixes. The prefixes above the double bar arecalled short prefixes; the prefixes below the double bar, extended prefixes. Thesymbol ε denotes the empty word. Each cell contains either + or −, denot-ing whether the concatenation of the corresponding prefix and suffix is in thetarget language or not. If a membership query has not yet been made for a par-ticular concatenation, that cell is empty. For example, the second row, labeled

38

Page 39: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

with the prefix a, denotes that the word ab is in the target language, whereasa and aab are not.

Suffixes in the table distinguish different prefixes. For example, the suffixε distinguishes the short prefix ab from the three other short prefixes, sincethe cell for the prefix ab and suffix ε is labeled +, but all other cells in thecolumn are labeled −. The suffix ab distinguishes the short prefix ε from thethree other short prefixes, since the cell for the prefix ε and suffix ab is labeled+, but all other cells in the column are labeled −. The observation table thusinduces an equivalence relation on prefixes with respect to the set of suffixes;this is an overapproximation of the Nerode congruence.

The table can be used to build a minimal DFA, in which the short prefixeswill become states, and the extended prefixes will be used to define transitions.For example, the short prefix ab will become a state q, and the extended prefixaba defines the state to which the DFA will move from q when processing thesymbol a. From this table, we can construct the DFA in Figure 2.2.

Prefix Suffixesε b ab

ε − − +

a − + −

ab + + −

b − − −

aa − − −

aba − − −

abb + + −

ba − − −

bb − − −

Figure 4.1. Example of an observation table.

We are now ready to formally define an observation table.

Definition 10 (Observation table). An observation table is a tuple O = (S,E,T )where

• S is a prefix-closed set of short prefixes,• E is a set of suffixes, and• T : (S∪ (S ·Σ))×E 7→ {+,−} is a labeling function. �

In an observation table, T (s,e) denotes the label for the word se, i.e., whetherse is a word in the target language or not.

Recall that the observation table induces an equivalence relation on prefixesrelative to the set of suffixes in the table. Formally, two prefixes s and s′ areequivalent, denoted s ≡OT s′, if T (se) = T (s′e) for all suffixes e in the table.

39

Page 40: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

An observation table is closed if all prefixes have an equivalent short prefix.This property can be formalized in the following definitions:

Definition 11 (Closed observation table). An observation table O = (S,E,T )is closed if for any prefix s′ ∈ (S∪ (S · Σ)), there is a short prefix s ∈ S suchthat s ≡Ot s′. �

An observation table that is not closed has a prefix s′ for which there is noequivalent short prefix. The table can be made closed by promoting s′ to ashort prefix.

Definition 12 (Constructing a finite automaton from an observation table). Aclosed observation table O = (S,E,T ) can be used to construct a finite automa-ton A = 〈Σ,Q,qε ,F,δ〉, in the following manner:

• Σ is the input alphabet.• The set Q of states is obtained from the set S of short prefixes: each short

prefix s ∈ S represents a state qs ∈ Q.• The initial state is qε , represented by the short prefix ε .• The set F consists of the accepting states in Q. A state qs ∈ Q is accept-

ing whenever T (s,ε ) = +, and rejecting otherwise.• the set T of transitions is extracted from the set of prefixes, so that for

some s ∈ S and a ∈ Σ, we get 〈qs ,a,qs′〉 ∈ T where s′ is the unique shortprefix such that sa ≡OT s′. �

The phases of L∗

Descriptions of the L∗ algorithm usually separate it into three phases: hy-pothesis construction, hypothesis validation, and counterexample processing.These phases are iterated in order until a correct final model is obtained.

Hypothesis construction. The Learner constructs words corresponding torows in the observation table that have not been filled yet, and makes member-ship queries for them. The Teacher’s answers are entered into an observationtable. The Learner continues making membership queries, occasionally pro-moting extended prefixes to short prefixes until the table is closed, at whichtime she constructs a hypothesis automaton. The hypothesis automaton repre-sents the Learner’s current best approximation of the target language.Hypothesis validation. The Learner submits the hypothesis automaton to theTeacher for an equivalence query. In theory, the Teacher can easily answer thisquestion by YES or NO, since it already knows whether the model is correct ornot. In practice, teachers with complete knowledge of the target language areoften not available, so other methods are used.

40

Page 41: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Counterexample processing. If the model is not correct, the Teacher pro-duces a counterexample. A counterexample is a word c = a1 . . . am on whichthe Learner and Teacher do not agree, i.e., a word that is accepted by the hy-pothesis automaton, but which is not in the language (or vice versa). Process-ing the counterexample will find a suffix that, when added to the observationtable, will distinguish a new state in the hypothesis automaton. Let us see howthis is done.

For each i such that 0 ≤ i ≤ m, the prefix a0 . . . ai of the counterexampleleads to a state in the hypothesis. We use ui to denote the short prefix thatrepresents this state. The short prefix u0 is ε . For a short prefix ui , we obtainui+1 as the short prefix that is Ot-equivalent to the prefix uiai+1. Since the tableis closed, such a prefix ui+1 must exist. We define v j as the suffix a j+1 . . . am

of the counterexample.We observe that there must be an index j such that exactly one of the words

u j−1a jv j and u jv j is in the language. If this were not the case, i.e., if for allj, either u j−1v j−1 and u jv j would both be in the language or both not in thelanguage, then either none or both of u0v0 and umvm would be in the language,meaning that c would not be a counterexample. At the index j where thecounterexample and hypothesis diverge, the suffix v j is a distinguishing suffixfor the prefixes u j−1a j and u j .

To resolve the counterexample, we add v j to the table as a new suffix. Thiswill make u j−1a j inequivalent to u j , promoting u j−1a j to a short prefix.

Example: Learning a small DFAWe illustrate L∗ by applying it to the regular language Labb*, recognized bythe DFA in Figure 2.2. The Learner initializes the observation table by addinga prefix row, labeled ε . To complete the table, she also makes membershipqueries for ε concatenated with each alphabet symbol, adding rows for a, andb. The Learner also adds one suffix column, labeled ε . Then, she makesmembership queries to fill the table. All membership queries are answeredwith −. The resulting table is closed, and from it the Learner constructs thefirst hypothesis automaton H1. The hypothesis and corresponding table areshown in Figure 4.2.

The model is not correct, so the Teacher provides a counterexample, ab,which is in the target language but rejected by H1. Figure 4.3 shows step bystep how the Learner processes the counterexample:

• The hypothesis starts running the counterexample in the state repre-sented by the short prefix u0, i.e., ε . At index 1 of the counterexample,we consider the prefix a, which also leads to the state in the hypothesisrepresented by the short prefix u1 = ε . At this point, we have the suffixv1 = b. Here, we see a discrepancy between the counterexample and the

41

Page 42: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

q0

a, bPrefix Suffixes

ε

ε −

a −

b −

Figure 4.2. Learner’s hypothesisH1 and corresponding observation table.

hypothesis: the word u0a1v1 = ab is in the target language, but the wordu1v1 = b is not. We have found an undiscovered location.

• The Learner adds b to the set of suffixes, and makes membership queriesto fill the table. This causes it to be no longer closed, since the extendedprefix a is not equivalent to the only short prefix, ε (see Figure 4.3(a)).

• The table is closed by making a a short prefix, and making membershipqueries for aa and ab. This again causes the table to be not closed, sincethe extended prefix ab is not equivalent to any of the short prefixes (seeFigure 4.3(b)).

• The table is closed by making ab a short prefix, and making membershipqueries for aba and abb.

Prefix Suffixesε b

ε − −

a − +

b − −

(a) Not closed observa-tion table (no short pre-fix is equivalent to a).

Prefix Suffixesε b

ε − −

a − +

b − −

aa − −

ab + +

(b) Not closed observa-tion table (no short pre-fix is equivalent to ab).

Figure 4.3. Observation tables.

The Learner now constructs her second hypothesis automaton H2. Thehypothesis and corresponding table are shown in Figure 4.4. This hypothesisis also not correct, since it does not distinguish the states represented by ε andb. A counterexample will prompt the addition of the suffix ab, resulting in

42

Page 43: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

q0q1

q2

ba

a

b

a

b

Prefix Suffixesε b

ε − −

a − +

ab + +

b − −

aa − −

aba − −

abb + +

Figure 4.4. Learner’s hypothesisH2 and corresponding observation table.

the table shown in Figure 4.1. This observation table correctly represents thetarget language, so the Teacher will answer YES to the last equivalence query.

Complexity of L∗

The L∗ algorithm is guaranteed to converge and produce a minimal DFA thatrecognizes the target language. If n is the number of states in the minimalDFA, m the length of the longest counterexample provided by the Teacher, andk the size of the input alphabet, the upper bound on the number of equivalencequeries is n, and on the number of membership queries, kn2 + n log m, by thefollowing reasoning.

• Each counterexample will trigger the addition of one new suffix (col-umn) to the observation table, which will distinguish two prefixes andgenerate a new state. Thus, the upper bound on the number of equiva-lence queries is n.

• Membership queries are used for two purposes: to close the table and toprocess the counterexample. A table is closed by making k membershipqueries for one-symbol extensions of a short prefix, which can happenat most n times, so the upper bound is kn2. To process the counterex-ample according to the approach by Rivest and Schapire [64], we find adistinguishing suffix using binary search. This requires log m member-ship queries. Since there may be at most n counterexamples, the upperbound is n log m. In total, the upper bound on the number of membershipqueries is kn2+ n log m.

43

Page 44: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

4.2 The SL∗ algorithm: inferring register automatamodels

The SL∗ algorithm (Paper III) is an adaptation of the L∗ algorithm to the sym-bolic setting, by which we mean that the algorithm can be used to infer registerautomata models of data languages. Like L∗, SL∗ follows the MAT model andis based on interactions between a Learner and a Teacher. An important aspectin which SL∗ differs from L∗ and SL∗ is the types of queries that the Teachercan answer. Instead of membership queries, the Teacher in the SL∗ algorithmanswers tree queries. When the Teacher answers a tree query, she actuallymakes a number of membership queries and generates a symbolic decisiontree from these queries and their answers. The symbolic decision tree is thenreturned to the Learner.

Conceptually, SL∗ is based on the symbolic Nerode congruence (Section 3.2)and on partition refinement. The Learner’s initial assumption is that all datawords are equivalent. By interacting with the Teacher, the Learner progres-sively discovers which prefixes can be separated by means of are separated bysymbolic suffixes, in the sense that the tree oracle returns inequivalent sym-bolic decision trees for them. Initially, the Learner makes tree queries thatreturn symbolic decision trees for shorter suffixes; she then refines them byasking for trees that represent longer suffixes. The approximation of the sym-bolic Nerode congruence is refined accordingly, until the Learner is able tobuild a hypothesis automaton where locations are represented by equivalenceclasses. The symbolic decision trees are then used to construct transitions,guards, and assignments in the automaton.

Observation tableAs in the L∗ algorithm, the Learner maintains an observation table, which or-ganizes the information the Learner has obtained from making queries. Thetable represents the current approximation of the symbolic Nerode congru-ence. It can be used to directly construct a register automaton, provided thatit satisfies certain properties. The observation table is continually improvedas the algorithm proceeds, though in a slightly different way than in L∗: sym-bolic decision trees are used in place of suffixes, but instead of adding newtrees, existing trees are refined by increasing the set of symbolic suffixes usedto construct them.

We illustrate how this works, using the example language recognized by theregister automaton in Figure 3.2.

Figure 4.5 shows an observation table used by the SL∗ algorithm. As inL∗, the table has rows. However, it has only one column, labeled with a setV of symbolic suffixes. Each cell contains the O(u,V ) tree constructed by thecanonical tree oracle for the prefix u with which the row is labeled, and the

44

Page 45: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Prefix Symbolic suffixesV = {ε , a, aa}

εa(p) a(p)

a(2) {x1}

a(p) | p < x1x2 := p

a(p) | x1 ≤ p

a(p)

a(p)x2 ≤ p

a(p)p < x2

a(2)a(1) {x2}a(p)

p < x2

a(p)x2 ≤ p a(p)

a(p)

a(2)a(1)a(0) a(p) a(p)

a(2)a(3) (same as a(2), with x1 renamed to x2)a(2)a(1)a(3) (same as a(2), with x1 renamed to x3)a(2)a(1)a(0)a(3) (same as a(2)a(1)a(0))

Figure 4.5. Example of an observation table.

set V of symbolic suffixes with which the column is labeled. For example, thecell in the second row, labeled a(2), contains the tree O(a(2),{ε,a,aa}) 1.

Just as in L∗, the prefixes in the table are divided into short and extendedprefixes. Each extended prefix is equivalent to one short prefix. When a sym-bolic decision tree in the lower part of the table is refined, it may becomeinequivalent to all symbolic decision trees in the upper part of the table. When-ever this happens, the associated prefix is promoted to a short prefix.

For each initial transition of a (u,V )-tree in the upper part of the table, thereis an extended prefix in the lower part of the table. For example, the (u,V )-tree for u = a(2)a(1) has a register x2 that stores the second data value fromthe prefix a(2)a(1), i.e., 1. The tree has two initial transitions, guarded byx2 ≤ p and p < x2, respectively. Each of these transitions will give rise to anew prefix, i.e., we will get two extended prefixes of the form uα(dg

u ), withdgu being the representative data value (see Definition 7) for each guard g after

u:

1This is actually the same tree as in Figure 3.5, rotated 90 degrees.

45

Page 46: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

• From the transition guarded by x2 ≤ p, we get dgu = dx2≤p

a(2)a(1).We can define dg

u as 3, so the resulting extended prefix is a(2)a(1)a(3).• From the transition guarded by p < x2, we get dg

u = dp<x2a(2)a(1).

We can define dgu as 0, so the resulting extended prefix is a(2)a(1)a(0).

Note that in the observation table in Figure 4.5, a(2)a(1)a(0) has actuallybeen promoted to a short prefix, whereas a(2)a(1)a(3) remains in the lowerpart of the table. This is because the (u,V )-tree for u = a(2)a(1)a(0) isinequivalent to all other (u,V )-trees in the upper part of the table.

We define an observation table for SL∗ as follows:

Definition 13 (Observation table). An observation table O for SL∗ is a tupleO = (U,U+,V,Z ) where

• U is a prefix-closed set of short prefixes,• U+ is a set of extended prefixes, each of the form uα(d)• V is a set of symbolic suffixes, and• Z maps each prefix u ∈ (U ∪U+) to a (u,V )-tree. �

The set U+ of extended prefixes contains exactly those words uα(d) whered is the representative data value dg

u for a guard g on an initial branch in Z (u).In an observation table, the Learner fills the cells Z (u) with (u,V )-trees

generated by an oracle. The observation table induces an equivalence relationon prefixes: two prefixes u and u′ are equivalent if Z (u) ≡γ Z (u′) for somerenaming γ of registers in the root locations of the trees.

An observation table is closed if for any extended prefix u ∈ U+, there isa short prefix u′ ∈ U such that Z (u) ≡γ Z (u′) for some renaming γ of regis-ters. This is exactly the same concept as in L∗, except individual suffixes arereplaced here by symbolic decision trees.

An observation table is register-consistent if whenever the ith data valuein a prefix u is needed in a symbolic decision tree Z (uα(d)), then there isalso a register in the symbolic decision tree Z (u) where this data value canbe stored. This is a new property specific to the register automata setting.The SL∗ algorithm needs this property to guarantee that data values that areneeded later in a run of the register automaton are stored when they first occur.Intuitively, if an observation table is not register-consistent, there is a relationbetween a data value in a suffix and a data value di in a prefix. Moreover,this relation is modeled as a guard in a symbolic decision tree, but there is noregister xi that can store di . In other words, we have failed to take into accounta register assignment.

A table is made register-consistent by adding the symbolic suffix αv to V ,where v refers to the symbolic suffix that causes xi to appear in a guard. Afteradding αv, Z (u) will contain an initial transition for α(d), and the value di

that occurs in u will be stored in the register xi .

46

Page 47: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Definition 14 (Constructing a register automaton from an observation table).A closed and register-consistent observation table O = (U,U+,V,Z ) can beused to construct a register automaton 〈L,l0,F,X,Γ〉, according to the follow-ing specification.

• The set L of locations is obtained from the set U of short prefixes: eachshort prefix u ∈U represents a location lu ∈ L.

• the initial location l0 is represented by the short prefix ε .• The set F consists of the accepting locations in L. A state lu ∈ L is

accepting whenever the root node of the tree Z (u) is accepting, andrejecting otherwise.

• For each location lu ∈ L, the registers X(l) are obtained as the registersin the root node of Z (u).

• The transitions are extracted from the set of short prefixes and their sym-bolic decision trees. Each initial transition 〈l,α(p),g,π,l ′〉 in Z (u) gen-erates a transition 〈u,α(p),g,π′,u′〉 where

– u′ is the unique short prefix in U such that uα(dgu ) ≡γ u′,

– α(p) is a parameterized input symbol with the parameter p,– g is a guard over p and X(l) (the same guard as in Z (u), and– π′ the assignment, generated from π and γ, i.e., it is a mapping

from X(Z (u′)) to X(Z (u)) and p. It assigns some data value to theregister xi ∈ X(Z (u)): either γ−1(xi ) if the already stored value iskept, or γ−1 ◦π−1(p) if the current input parameter is stored). �

When building a register automaton from the table, the short prefixes will beused as locations, since they are mutually inequivalent. The extended prefixeswill be used to represent transitions: if u ∈ U is a short prefix (which willbecome a location) then, when processing a data symbol α(d) that satisfiesa guard g in one of the initial transitions of Z (u), the automaton will moveto the location represented by uα(dg

u ). The guards and assignments on eachinitial transition of a (u,V )-tree will thus become guards and assignments inthe register automaton.

For example, the short prefix a(2) will become a location l in a regis-ter automaton constructed from the observation table in Figure 4.5. The treeZ (a(2)) has two initial transitions, guarded by x1 ≤ p and p < x1, respec-tively. Each of these is represented as a prefix in the table: a(2)a(1), whichis also a short prefix, and a(2)a(3), which is an extended prefix. Thus, wecreate two transitions from l, both of form 〈l,α(p),g,π,l ′〉:

• The guard g is p < x1. The corresponding prefix is a(2)a(1), whichis a short prefix. In the tree, we have x2 := p, so the assignment π isπ(x2) = p. The location l ′ is the location represented by a(2)a(1).

• The guard g is x1 ≤ p. The corresponding prefix is a(2)a(3), for whichZ (a(2)) ≡γ Z (a(2)a(3) whenever γ(x1) = x2. In the tree, we have noassignment (which means we reassign registers). Thus, we obtain the

47

Page 48: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

assignment π(x1) = γ−1(x2). The location l ′ is the location representedby a(2).

The phases of SL∗

Similar to L∗, we also separate the description of SL∗ into three phases: hy-pothesis construction, hypothesis validation, and counterexample processing.These phases are iterated in order until a correct final model is obtained.

Hypothesis construction. The Learner makes tree queries for combinationsof concrete prefixes and sets of symbolic suffixes. The Teacher takes ad-vantage of the canonical tree oracle to construct (u,V )-trees, which are thenreturned to the Learner and used to fill the observation table. The Learnercontinues making tree queries until the table is closed and register-consistent,which may involve adding more prefixes and more symbolic suffixes. Then,she constructs a hypothesis automaton representing the Learner’s current bestapproximation of the target language.Hypothesis validation. The Learner submits the hypothesis automaton to theTeacher for an equivalence query. In theory, the Teacher can easily answer thisquestion by YES or NO, since it already knows whether the model is corrector not. In practice, teachers with complete knowledge of the target languageare often not available, so other methods are used. We offer a more detaileddiscussion of how to implement or mimic equivalence queries in practice, inPaper IV.Counterexample processing. If the model is not correct, the Teacher pro-duces a counterexample. A counterexample is a string on which the Learnerand Teacher do not agree, i.e., a string that is accepted by the hypothesis au-tomaton, but the Teacher knows should not be accepted, or vice versa. Thecounterexample is processed in a manner similar to that described by Rivestand Schapire [64]. The result of processing a counterexample will be the ad-dition of a new symbolic suffix to V . We describe counterexample processingin more detail in the next section.

Counterexample processing in SL∗

The goal of counterexample processing in SL∗ is to find a symbolic suffixwhich we will add to the set V of symbolic suffixes in the observation table.Let us see how this can be done.

We assume that the Learner has constructed a hypothesis automaton H ,submitted it for an equivalence query, and received a counterexample c in re-turn. The counterexample is a data word on whichH and the Teacher disagree,i.e., a word that is in the target language but rejected byH , or vice versa.

48

Page 49: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Technically, the counterexample is a data word c= α1(d1) . . . αm (dm ). Eachprefix α1(d1) . . . α j (d j ) leads to a state in the hypothesis that is represented bythe short prefix u j . We define v j as the symbolic suffix α j+1 . . . αm .

In the counterexample, there is some index i where the symbolic decisiontree OL (ui ,{αi+1 . . . αm }) disagrees with the symbolic decision tree OL (ui−1,{αi . . . αm })on how to classify the counterexample (i.e., one tree accepts and the other doesnot). If there is no such i, then there is no discrepancy between the counterex-ample and the hypothesis, in analogy with the principles for counterexampleprocessing in L∗.

There are two possible reasons why the counterexample and the hypothesiswould disagree:

(i) One of the guards in the hypothesis is not refined enough. More pre-cisely, the guard g satisfied by αi (di ) in Z (ui−1) should be distinct fromthe guard satisfied by αi (dg

ui−1 ) (the representative data value for g).(ii) The renaming γ of registers used in the hypothesis is wrong. More

precisely, the subtree of OL (ui ,{αi+1 . . . αm }) starting in the locationreached by αi (di ) is not equivalent to OL (ui−1,{αi . . . αm }) under γ.

We describe each of these cases below.

Case (i): One of the guards in the hypothesis is not refined enough. Letgi be the guard of the initial transition of Z (ui−1) that αi (di ) satisfies, and g′ithe guard of the initial transition of OL (ui−1,{vi−1}) that αi (di ) satisfies. Inthis case, g′i is more refined than gi , which means that the symbolic decisiontree OL (ui−1,{vi−1}) has more initial branches than the symbolic decision treeZ (ui−1) in the observation table.

We add vi−1 to V , which will result in at least one new or refined transitionfrom ui−1 in the hypothesis (Definition 8).

Case (ii): The renaming γ of registers used in the hypothesis is wrong.Let T be the subtree of OL (ui−1,{vi−1}) that starts in the location reached byαi (di ) via a transition guarded by gi . We have that T .γ OL (ui ,{vi }), i.e., thetrees are not isomorphic under the renaming γ.

We add vi to V , which will result in one of two possible outcomes: eitherthe remapping used in the hypothesis will be refined, or the prefix ui−1αi (dgi

u )will be promoted to a short prefix.

By the above arguments, we see that processing the counterexample willresult in a new suffix being added to the set of symbolic suffixes in the ob-servation table. This will refine the symbolic decision trees in the observationtable, which in turn will lead to a new location or a new transition being dis-covered, or to the renaming of registers used in the hypothesis being refined.

49

Page 50: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Example: Inferring a small register automatonWe describe an application of SL∗ to learning the register automaton in Fig-ure 3.2. The Learner initializes the observation table by adding a prefix row,labeled ε . She makes a tree query for ε and the set {ε } of symbolic suffixes.The resulting tree is added to the observation table. To complete the table,she also makes tree queries for ε concatenated with the only alphabet symbol,adding a row for a(1). The resulting table is closed and register-consistent,and from it the Learner constructs its first hypothesis automaton H1. The hy-pothesis and corresponding table are shown in Figure 4.6.

l0

a(p)Prefix Symbolic suffixes

V = {ε }ε

a(2) same as ε

Figure 4.6. Learner’s hypothesisH1 and corresponding observation table.

The model is not correct, so the Teacher provides the counterexamplea(2)a(1)a(0), which is not in the target language but accepted byH1. Afterprocessing the counterexample, the suffix aa is added to the set of symbolicsuffixes. This causes the table to be not closed, because the tree Z (a(2)) is nolonger isomorphic to the tree Z (ε ) (see Figure ). The prefix a(2) is promotedto a short prefix.

The Learner makes tree queries for the extended prefixes corresponding toeach of the initial transitions in Z (a(2)), i.e., to obtain OL (a(2)a(1),V ) andOL (a(2)a(3),V ). The result is shown in Figure 4.7. The table is not closed,since the tree Z (a(2)a(1)) is not isomorphic to any of the trees in the upperpart of the table. The table is closed by promoting a(2)a(1) to a short prefix.

The Learner now makes tree queries for the extended prefixes correspond-ing to each of the initial transitions in Z (a(2)a(1)), i.e., to obtainOL (a(2)a(1)a(0),V ) and OL (a(2)a(1)a(3),V ). The table is again notclosed, since the tree Z (a(2)a(1)a(0)) is not isomorphic to any of the treesin the upper part of the table. The table is closed by promotingZ (a(2)a(1)a(0)) to a short prefix. The Learner makes a tree query for theextended prefix a(2)a(1)a(0)a(3) corresponding to the only initial transi-tion in Z (a(2)a(1)a(0)) in order to complete the table.

50

Page 51: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

The suffixes in V now distinguish the four locations l0, l1 l2 and ls in Fig-ure 3.2. In the observation table in Figure 4.5, location l0 is represented by ε ,location l1 is represented by a(2), location l2 is represented by a(2)a(1) andthe sink location is represented by a(2)a(1)a(0).

Prefix Symbolic suffixesV = {ε , a, aa}

εa(p) a(p)

a(2) {x1}

a(p) | p < x1x2 := p

a(p) | x1 ≤ p

a(p)

a(p)x2 ≤ p

a(p)p < x2

a(2)a(1) {x2}a(p)

p < x2

a(p)x2 ≤ p a(p)

a(p)

a(2)a(3) (same as a(2), with x1 renamed to x2)

Figure 4.7. Not closed observation table.

The Learner now has the observation table in Figure 4.5. This observationtable correctly represents the target language, so the Teacher will answer YESto the last equivalence query.

4.3 Automata learning in realistic scenariosIn the latter part of this chapter, we described SL∗, a learning algorithm forregister automata. We will now describe an extension to SL∗ that we havedeveloped, which is oriented towards more practical applications of automatalearning techniques. Recall that SL∗ infers register automata models of regulardata languages, i.e., automata that accept or reject data words. However, manyactual programs or components are not quite as simplistic: they do not simplystate whether executing a command or invoking a method is OK or NOT OK;they produce output of some kind.

Figure 4.8 shows code for a small program that implements a map. The mapstores key-value pairs 〈k,v〉 and exposes two methods: put and get. When weinvoke put(v), the map checks that its capacity is not exhausted, creates a new

51

Page 52: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

unique key k, and returns k as output. When we invoke get(k), the map checksthat it contains a pair 〈k,v〉 and returns v as output.

class KeyGenMap {private Map map = new HashMap ();K put(V val) {

assert map.size() < MAX_CAPACITY;K key = generateUniqueKey ();map.put(key , val);return key;

}V get(K key) {

assert map.containsKey(key);return map.get(key);

}}

Figure 4.8. Code for a map that generates keys.

How can we infer a model of a program such as the map in Figure 4.8,in a manner similar to SL∗? One way is to invent a new register automatonmodel (or use an existing one) that allows for non-binary output, and a learningalgorithm that can infer such models (e.g., [46]). Another way, which wewill explore here, is to use the register automata formalism from Paper III buttreat sequences of alternating input and output as words in a language. In thismanner, we can treat the problem as one of language inference, and apply theSL* algorithm without fundamental modifications.

We extend the SL∗ algorithm, not by altering the basic algorithm, but byimplementing it in a tool together with practical extensions.

Concretely, we consider each input or output symbol to be a data symbol ofthe form α(d). Each data value d has a particular type that determines whichother values it can be compared to. In the map example, types are keys andvalues. Only values of the same type can be compared to each other, meaningthat each type is associated with a particular theory. This allows for moreefficient generation of symbolic decision trees, since relations between datavalues of different types need not be considered

An issue with register automata in general is that they cannot model freshdata values, i.e., data values that are inequivalent to all other data values, sincethis would require comparing all data values in a domain and possibly usingan unbounded number of registers to do so. However, register automata canbe used to model a particular restricted form of freshness, where a data valueis inequivalent to all other data values that we have already seen in a word ofbounded length. We take this approach to modeling freshness, and introduce a

52

Page 53: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

unary relation fr to represent that a data value is inequivalent to all previouslyseen data values.

Example: A register automaton with input and output, typedparameters, and fresh data valuesFigure 4.9 shows a register automaton model of the map that generates keys,i.e., the external behavior of the component in Figure 4.8. It is limited to ca-pacity 1, for space reasons. The register automaton is a condensed versionof the actual register automaton we learn: intermediate states, where no in-put is accepted and only output generated, are omitted and replaced by

/in

a transition label. For example, the loop in location l0 is actually inferredas a transition get (pK ) |−

−to an intermediate state, and then another transition

error |−−

back to l0. Transition labels are of the form α(pT ) |gπ where α(p) is a

parameterized symbol (sometimes abbreviated α if there is no parameter), Tis the type of the parameter p, g is a guard over p and registers, and π is anassignment. Registers also have types, so a register xT can only store a datavalue of type T .

The register automaton processes input symbols starting in location l0. Forexample, the word put(a)

/key(4)get(4)

/val (a) is accepted by the automa-

ton. This corresponds to invoking put(a) and receiving 4 as output, then in-voking get(4) and returning a as output. The word put(a)

/key(4)get(2)

/error

is also accepted by the automaton. This corresponds to invoking put(a), re-ceiving 4 as output, then invoking get(2) and getting an error since no valuewith the key 2 is stored in the map. �

l0 l1

put (pV ) |−xV :=pV

/ key (pK ) | f r (pK )xK :=pK

get (pK ) |−−

/ error |−−

get (pK ) |xK,pK

/ error |−−

put (pV ) |−−

/ error |−−

get (pK ) |xK=pK

/ val (pV ) |xV=pV

Figure 4.9. Model of a map that generates keys.

53

Page 54: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

5. Summaries of papers

We summarize briefly the contents of each of Papers I-IV.

I A succinct canonical register automaton modelPaper I defines a register automaton model that recognizes data languageswhere data values can be compared for equality. It is an extended journalversion of the paper [28] where the automaton model was originally defined,using slightly different but equivalent notation.

The motivation behind the paper was to come up with a formalism for reg-ister automata that was canonical, while being as succinct as possible.

At the time, different forms of (canonical) register automata had been pre-sented, but they all suffered from restrictions such as requiring a total orderon data values, or requiring stored values to be unique. These restrictionsmade the resulting automata unnecessarily large (in terms of state-space, andin terms of the number of registers they needed). Another reason for unnec-essarily large models is encoding all equalities between data values in a dataword, regardless of whether they are needed in order to recognize the datalanguage.

In the paper, we represent a data language over the theory of equality asa classification of a set of data words. We show that any data language canbe represented by restricting its classification to a finite minimal subset ofessential data words. The essential data words exhibit exactly the equalitiesthat are needed in order to recognize the data language. For each non-essentialdata word, there is an essential data word with at most as many equalities, andwhose classification extends to the non-essential word.

To obtain the set of essential words for a data language, we organize datawords in a tree according to two partial orders: how many equalities a wordhas, and at what point in the word the first equality (if any) occurs. Nodes inthe tree are then merged whenever they are isomorphic, in a procedure similarto minimization of binary decision diagrams (BDDs). The resulting tree isminimal in a certain class, and the branches are the essential words for thedata language1.

The main result in the paper is a Nerode congruence for data languages overthe theory of equality. Using this congruence, we can fold a set of essential

1The essential words correspond to the branches of the canonical symbolic decision trees inPaper III.

54

Page 55: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

data words into a register automaton. Two essential data words will lead tothe same location in the automaton if their residual languages are equivalent,i.e., if whenever we extend both data words with equivalent suffixes, the wordswill remain equivalent.

The automata we construct in this manner are canonical and determinate:a data word may correspond to more than one path in the automaton, but thepaths are either all accepting or all rejecting. The models can be made de-terministic, but this conversion can be done in several ways and will thus notresult in canonical models. We prove that a data language is recognizable bya determinate register automaton if and only if the Nerode congruence on theset of essential words has finite index.

We compare our formalism to others where different restrictions have beenmade on the data values, and show that our models can be exponentially moresuccinct while remaining canonical.

II Inferring Canonical Register AutomataPaper II presents an extension of the L∗ algorithm, enabling us to infer theregister automata defined in Paper I. The L∗ algorithm, a well-known exampleof active automata learning, had already been successfully used to infer con-trol flow models of, e.g., communication protocols. However, for models thatcapture both control flow and data flow (e.g., register automata), existing au-tomata learning frameworks were either too simplistic, required cumbersomeuser-supplied abstractions on the data domain, or used a two-step approachwhere the automaton and the abstraction were generated separately. The pa-per was motivated by the lack of an algorithm that could directly infer theinterplay between data flow and control flow, and generate accurate models.The natural choice was to extend the L∗ algorithm to the register automatasetting.

The L∗ algorithm for finite automata uses the well-known Nerode congru-ence as a basis. In this paper, we used the Nerode congruence for registerautomata (defined in Paper I) as a basis for the new extended L∗ algorithm.The extended algorithm iterates two phases: hypothesis construction, hypoth-esis validation, and counterexample processing.

In the hypothesis construction phase, membership queries are made andthe results collected in an observation table. In contrast to L∗ membershipqueries are made for sets of data words rather than individual words. Techni-cally, we concatenate a data word (prefix) with an abstract suffix2, i.e., a dataword where data parameters are not instantiated. Then, the data parametersare instantiated with data values equal or inequal to data values in the pre-fix, according to a set of constraints. The answer to a membership query is aclosure, i.e., a mapping from each constraint in the set to ACCEPT or REJECT.

2Abstract suffixes correspond to the symbolic suffixes in Paper III.

55

Page 56: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

The algorithm continues to make membership queries until the observationtable is closed and register-consistent. The register-consistency criterion isnew and specific to the register automata setting. If the observation table is notregister-consistent, this means that there is an equality between a data valuein a suffix and a data value in a prefix, but there is no register that stores thedata value in the prefix. In other words, we have failed to take into account aregister assignment.

When the observation table is closed and register-consistent, a hypothesisautomaton is generated and submitted for an equivalence query. If the hypoth-esis automaton is correct, the algorithm terminates; otherwise, a counterexam-ple is returned and processed. The counterexample is guaranteed to induce anew location, transition, or register.

We have implemented the algorithm in LearnLib and applied it to a frag-ment of the XMPP protocol for instant messaging. The inferred model needsa fraction of the states necessary for a model without registers.

III Active Learning for Extended Finite State MachinesThis paper introduces a general black-box framework, SL∗, for inferring reg-ister automata where data values can be compared using different binary rela-tions. It is an extended journal version of the paper [29] where the frameworkwas first introduced. The paper also contains an updated and improved defini-tion of the register automata for binary relations first presented in [30].

Prior to our paper, there had been several other works presenting extensionsof the L∗ algorithm to extended finite state machines. These, however, tendedto focus on only equality between data values and sometimes require addi-tional information, such as access to source code. The motivation behind thispaper was to bypass these restrictions and present a general extension of L∗

that could be used out of the box, and that would require as little manual assis-tance as possible. We wanted the algorithm to take into account both differentdata domains as well as different relations between data values.

In this paper, we describe how to learn register automata that recognize datalanguages where data values can be compared using different binary relations(not just equality). To that end, we also present a formalism for register au-tomata that is parameterized on a theory, i.e., the combination of a data domainand one or more binary relations. The learning algorithm is a generalizationof our prior algorithm (Paper II), while the automaton formalism improves theone presented in [30], and is a generalization of the one presented in Paper I.

A central concept in the SL∗ framework is that of tree queries. Tree queriesare used to characterize and separate prefixes, much in the same way as mem-bership queries in classic L∗. Making a tree query for a prefix returns a sym-bolic decision tree that describes a fragment of the data language after theprefix, i.e. ‘what happens’ when concatenating the prefix with different suf-

56

Page 57: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

fixes. Branches in a decision tree represent different suffixes and the nodesstate whether a particular data word (prefix concatenated with suffix) is ac-cepted or rejected. The SL∗ algorithm iterates the same phases as L∗: hypoth-esis construction, hypothesis validation, and counterexample processing. Theobservation table stores prefixes and corresponding symbolic decision treesobtained by making tree queries.

We give necessary conditions for a tree oracle, i.e., the function that gen-erates a symbolic decision tree as an answer to a tree query. For the theoryof equalities over an infinite domain (such as natural numbers) and for thetheory of inequalities over rational or real numbers, we describe tree oraclesand show that both fulfill the stipulated conditions. We give a Myhill-Nerodetheorem for these theories. Increments and constants are also discussed andwe give an idea of how to construct tree oracles when using these extensionstogether with a theory.

We evaluate our framework on a series of small examples. The examples(e.g., a sequence number counter, an alternating bit protocol, and a prepaidcard) illustrate the different theories we can use. We also infer two largermodels: the connection establishment part of TCP and a priority queue. Theseindicate the utility of our framework in more realistic settings.

IV RALib: A LearnLib extension for inferring EFSMsPaper IV introduces RALib, an extension to the LearnLib framework for au-tomata learning. The extension consists of a stable reimplementation of theSL∗ algorithm (Paper III) together with a set of additional features.

While the SL∗ algorithm provides a general framework for inferring reg-ister automata, it was not yet fully usable in real-world scenarios. By reim-plementing the algorithm, adding more practical features and packaging it asan extension to LearnLib, we wanted to make SL∗ more easily available anduseful to the public.

In the paper, we describe several features that were added onto the SL∗

algorithm. With RALib, users can learn models with input and (non-binary)output. This is possible through a filter that essentially treats sequences ofalternating in- and output as if they were sequences of input. An input/outputsequence can then be ’accepted’ or ’rejected’ based on whether it is a validsequence or not when executed on the target component. RALib also handles’fresh’ (i.e., previously unseen) data values, something which in principle isnot possible in a register automaton. RALib, however, can work around thisproblem by considering data values ’fresh’ if they do not occur in a particulargenerated test case.

Another important feature is typed data parameters: in RALib, each dataparameter is assigned to a particular theory. This allows users to mimic theactual data types in a target component. It also allows for mixing different

57

Page 58: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

theories in the same model, since, e.g., parameters of one type can be com-pared for equality and parameters of another type for inequality (<). Finally,RALib comes with a Java class analyzer, and heuristics for finding counterex-amples.

We evaluate RALib by comparing it to Tomte and LearnLibRA, the twomost similar available tools for inferring register automata, on two aspects:performance and practicality. We use the benchmarks presented in [5]. Inthe paper, we detail different optimizations and how they affect the numberof tests executed during learning. In terms of the number of tests used duringlearning, RALib always outperformed LearnLibRA and often Tomte on thesebenchmarks. In terms of practicality, we tested the Java class analyzer byinferring some data structures (FIFO/LIFO sets and a priority queue). Here,RALib needed a similar number of tests as the other tools. However, none ofthe other tools were able to infer the priority queue, since they only supportdata languages over the theory of equality.

58

Page 59: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

6. Related work

This chapter reviews related work, along the yhmain topics of this thesis: au-tomata models and automata learning. We also discuss some practical uses forautomata learning approaches. For more specific related work, we refer thereader to Papers I- IV.

6.1 Representing languages over infinite alphabetsMany different formalisms for representing languages over infinite alphabetshave been proposed; an overview can be found in [67]. In these formalisms,alphabet symbols can be taken directly from an infinite domain (e.g., integersor natural numbers), but they can also come from a finite domain but carrydata values from an infinite domain. Languages over infinite alphabets can beused for representing, e.g., communication protocols or behavioral interfacesof data structures.

A popular way to model languages over infinite alphabets is using differenttypes of automata, e.g., register automata or finite-memory automata [15, 49](see also Chapter 3), timed automata [9], pebble automata [38], and data au-tomata [19, 20]. Here, we consider automata that are similar to the registerautomata we defined in Chapter 3.

Finite-memory automata (register automata).Introduced by Kaminski and Francez, finite-memory automata [49] is an earlyformalism often also referred to as register automata. Finite-memory automatause a sequence of ’windows’ to store values, similar to registers in a registerautomaton. At each transition, the input symbol is compared to the all valuescurrently stored in the windows.

To improve the formalism, Benedikt et al [15] do not store all data valuesin their deterministic finite-memory automata – only those that are needed inlater comparisons. Such values are called memorable. This feature lets themuse potentially fewer registers than Kaminski and Francez’s finite-memoryautomata, while keeping the same expressive power. For example, considerthe data language over a set A = {a, b} of actions and the domain of integers,where words are in the language if the data value in a b-symbol is equal to thedata value in the previous a-symbol. The word a(1)b(1) is in this language,but the word a(1)b(2) is not. This means that the value 1 is memorable andshould be stored by the automaton after reading a, since there is a comparison

59

Page 60: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

between the data value in the b-symbol and the data value in the a-symbol.Deterministic finite-memory automata are canonical, due to a Myhill-Nerode-like theorem provided by the authors.

In both finite-memory automata and deterministic finite-memory automatamodels, the registers are ordered, and the stored data values are unique. Theserestrictions can, however, also often cause the automata to become unneces-sarily large. We discuss this in detail in Section 6 of Paper I, but here give anidea of the implications.

The order on registers means that (assuming registers are ordered x1 . . . xn)whenever a register xi is assigned a value on some transition in the automaton,then the register xi−1 must have been assigned a value on some precedingtransition. For example, we consider the case where data symbols can carrymultiple data values, i.e., of the form α(d1 . . . dk ) for some k. With an orderon registers, we get k! different possible ways to store the values. In contrast,our register automata do not have an order on registers, so we need (at most)k registers in this case.

The requirement to store unique data values means that no more than oneinstance of the same data value can be stored in registers (even if there areseveral in a data word). With this restriction, a data word α1(d1) . . . αm (dm )where di = d j will not use the same path in the automaton as a data wordα1(d ′1) . . . αm (d ′m ) with di , d j , since in the first case, only one register canbe assigned and compared to data values in future transitions, whereas in thesecond case, two registers are available. In contrast, our register automata donot require stored data values to be unique, which means that both words canuse the same path in the automaton.

Decidability for finite-memory automata and connections to different log-ics have been investigated, e.g., in [25, 37, 39, 62, 65]. In [50], finite-memoryautomata were extended to the setting where registers can be updated nonde-terministically during a run of the automaton.

Data automata.Bojanczyk et al introduce data automata [19, 20, 21] and reason about themfrom an algebraic perspective. In [17], Björklund and Schwentick present theexpressively equivalent class-memory automata which, unlike data automata,can be made deterministic.

A data automaton considers two components of a data word α1(d1) . . . αn (dn ):its sequence α1 . . . αn of actions and its classes, i.e., the sets of positions in theword that has equal data values. To do this, it is essentially composed of twoautomata: a base automaton and a class automaton. The base automaton pro-cesses the sequence of actions in a word and accepts or rejects it. It comparesdata values for equality, and produces class strings, i.e., sequence of actionswhose data values are equal. For example, a base automaton that accepts thedata word a(20)b(43)b(20) can produce the output class strings ab and b.

60

Page 61: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

The class automaton recognizes the regular language that consists of theset of output class strings produced by the base automaton. In this manner,the data automaton as a whole can be used to recognize data languages wheredata values are compared for equality, but they can also recognize languageswhere all data values are inequal: this simply corresponds to the regular lan-guage over class strings of length 1. Register automata cannot recognize theselanguages since that would require storing a possibly infinite number of datavalues in registers. Data automata and class-memory automata are thus strictlymore expressive than register automata.

Symbolic automata.A formalism that corresponds to register automata over different theories isthat of symbolic automata [35, 57, 44]. Symbolic automata, however, do notuse registers. They process words over infinite alphabets by using constraintsover data symbols, so transitions are labeled with sets of input symbols, sim-ilar to guards in our register automata. For example, a symbolic automaton’sbehavior may be determined by whether an input symbol is within a certaininterval or not. This corresponds to a register automaton for the theory ofinequality (<).

Symbolic transducers [71] are very similar to symbolic automata, but aug-mented with registers, which allows for more succinct models. They havebeen used, e.g., for processing regular expressions. In [35] the authors de-scribe some minimization algorithms for symbolic transducers. However, thework on symbolic transducers has rather focused on minimization and equiv-alence checking algorithms, not on providing a Nerode congruence or anycanonicity properties.

Automata with fresh data values.A formalism that focuses on recognizing ’fresh’ data values is fresh-registerautomata [69]. Fresh data values, in this sense, refers to data values that havenot been seen yet. A locally fresh data value is one that is not currently storedin a register; a globally fresh data value is one that has not been stored inany register so far. Fresh-register automata can capture both local freshnessand global freshness. For example, a fresh-register automaton can recognizethe data language of words α1(d1) . . . αk (dk ) where for any pair di , d j ofdata values such that i , j, di , d j . This language cannot be recognized by aregister automaton.

Finite-memory automata and our register automata are a less expressivesubclass of fresh-register automata, since they can capture local freshness butnot global freshness. Session automata [22, 23] are another less expressivesubclass, which can capture global freshness but not local freshness. Sessionautomata are thus incomparable to register automata.

Session automata can either store a fresh data value, or read a data valuefrom a register. Similar to the representative data words we use in Paper III,

61

Page 62: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Bollig et al [23] define a symbolic normal form for data words, representingwords that are indistinguishable by the equality relation and freshness. A dataword on this form represents a unique set of concrete data words (similar tothe symbolic automata). A set of words on symbolic normal form is a regularlanguage if the words use only a bounded number of registers. Based on this,the authors define canonical session automata.

Another form of automata that is incomparable to our register automatais variable automata [42]. These are extensions of (nondeterministic) finiteautomata that can be used to model languages over infinite alphabets. They arestrictly less expressive than nondeterministic finite-memory automata [50]. Ina variable automaton, variables are only assigned once and cannot change theirvalues during a run. A nondeterministic variable automaton can assign valuesto variables nondeterministically. Variable automata can be used to recognize,e.g., data languages over the theory of equality, or data languages where eachalphabet symbol appears in each word. The authors of [42] propose variableautomata as a simpler alternative to e.g., register automata or pebble automata,since many constructions and algorithms for nondeterministic finite automataalso apply to variable automata.

6.2 Learning extended finite state machines with L∗

In Chapter 4, we described the most common algorithm for inferring finitestate machines using active learning: L∗. In this section, we describe dif-ferent tools and frameworks that make use of the algorithm in order to inferextended finite state machine models of components, programs, or interfaces.Most approaches are white-box, meaning that domain knowledge, manual ab-stractions, and symbolic execution can all be used, since the source code ofa target component is accessible. Other approaches (including our own) areblack-box, meaning that the target component is presumed to be (almost) com-pletely unknown and the source code is unavailable. In black-box settings, weonly have access to a component’s interface.

Black-box methods.An early method for inferring Mealy machines with comparisons for equalityis presented by Berg et al [16]. The authors generalize the L∗ algorithm to beable to infer symbolic Mealy machines, i.e., Mealy machines where transitionsare labeled with parameterized symbols, guards, and assignments – similar tothose in our register automata. Their approach is to first infer a finite-stateMealy machine where data values are taken from a small domain, and thenfolding this into a symbolic Mealy machine. This differs from SL∗ in twomain respects: first, in that the inferred models are forms of Mealy machines,i.e., models with non-binary output; second, in that their algorithm producesan intermediate model before producing the final one.

62

Page 63: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

Aarts et al have developed a technique [1, 6] for inferring finite-state modelsof real components or programs using an approach based on the L∗ algorithm.The technique has been implemented in a tool called Tomte. A mapper isplaced as an extra layer between the learning algorithm and the target compo-nent. In principle, the mapper allows Tomte to infer models where data valuesare compared using different relations, as long as these are defined in the map-per. The mapper thus provides a finite-state abstraction of an infinite-statesystem.

With manually constructed mappers that compare data values for equality,Tomte has been used to infer models of protocol entities [7], the biometricpassport [8], and models of bank cards [2]. Tomte can also be used in a black-box setting, by employing an automatically generated mapper [4]. In thesecases, input values can only be tested for equality. This version of Tomtewas recently [3] extended with support for inferring ’fresh’ data values in acomponent’s output, i.e., values that have not been previously processed bythe component as input.

Maler and Mens [57] develop a symbolic version of the L∗ algorithm in or-der to learn symbolic automata. As in the SL∗ algorithm, counterexamples donot necessarily trigger the addition of a new state, but a counterexample canalso be used to modify the boundaries of the subsets of the alphabet that areused as transition labels. The authors’ results indicate that for very large alpha-bets, their algorithm is more efficient than L∗; we noted a similar result whencomparing the performance of register automata learning to L∗ in Paper II.

Bollig et al [23] present a learning algorithm that infers the canonical ses-sion automaton for a given data language, based on the fact that the languagerecognized by a session automaton can be represented as a regular set of sym-bolic data words. They infer session automata using a version of Rivest andSchapire’s optimization of L∗.

White-box methods.In the Sigma∗ framework [24], the L∗ algorithm is used to infer symbolic look-back transducers (SLT). An SLT is a limited form of symbolic transducers,where the generated output may only depend on the k last seen input sym-bols. Since the framework is white-box, symbolic execution is used to answermembership queries and to discover the input alphabet. The Teacher main-tains an abstraction (a separate SLT) representing an overapproximation of thetarget component or program, which is used to answer equivalence queries:the abstraction is checked for equivalence against the Learner’s hypothesis. Ifa counterexample is produced, this either means that the hypothesis is wrongor that the abstraction must be refined. Sigma∗ has been used, e.g., for verify-ing web sanitizers. The limitation of Sigma∗ is that it only works in scenarioswhere it is sufficient to consider the k most recent input values.

Psyco [40] is a white-box framework that combines an implementation ofL∗ with symbolic execution for generating temporal component interfaces.

63

Page 64: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

The authors use L∗ to generate guards over method parameters and to deter-mine when existing guards must be refined. They use three regular languagesto represent legal/illegal/unknown method traces, and keep track of currentmethod guards in a map. To answer a membership query, the Teacher exercisesall possible paths through each method in the query, by satisfying the guards,answering true/false/unknown depending on whether the particular path is le-gal, illegal, or unknown. The three languages are modeled as a three-valuedfinite automaton that captures the ordering between method invocations, i.e.that represent what sequences of method invocations are legal and what se-quences are not.

Xiao et al [73] infer finite-state models of Java classes using an extendedversion of the L∗ algorithm. Transitions are annotated with method calls andguards over data fields taken directly from the inferred Java classes. Guardsare refined whenever the output generated by a particular method depends onsome data parameter, indicating that this relation also must be modeled. Therefinement is done using binary classification with support vector machines(SVM). The authors use a random walk algorithm for simulating equivalencequeries through testing.

Grinchtein and Jonsson [41] extend the L∗ algorithm to timed automata andintroduce event-recording automata.

64

Page 65: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

7. Conclusion(s)

In this thesis, we have presented work in two main areas: automata theory,and algorithms for automata learning. Here, we give concluding remarks andindicate some directions for future work.

Automata theoryIn this area, our main goals were to adapt register automata to recognize awider class of languages (i.e., by not restricting relations on data values to justequality), and to obtain canonicity by means of a Nerode congruence for ourregister automata.

Paper I introduced a register automaton model that recognizes data lan-guages where data values can be compared for equality. It is equivalent, interms of expressivity, to other similar formalisms. The advantages of our reg-ister automata is that they are canonical, and that they can be exponentiallymore succinct compared to other formalisms. In the paper, we showed thatour models are minimal under certain restrictions. An important contributionin this paper is a Nerode equivalence and Myhill-Nerode theorem for the classof data languages over the theory of equality.

The automaton model in Paper III is a generalization of that in Paper I inthat it is parameterized on a theory, e.g., a set of binary relations together witha data domain. (The model in Paper I corresponds to our register automatonmodel for the theory of equality over some data domain.) An important con-tribution in this paper is a Nerode equivalence and a Myhill-Nerode theoremfor data languages over certain theories, e.g., equality on any domain, andinequality (<) on certain domains (such as real numbers).

Algorithms for automata learningIn this area, our main goals were to adapt an active automata learning algo-rithm (L∗) to learning register automata. The first sub-goal was to be ableto infer the models in Paper I, where data values can only be compared forequality. Then, we wanted to extend the approach to also infer the models inPaper III, where data values can be compared using binary relations. In addi-tion, we wanted to provide a tool that would allow us to learn real programsusing our learning algorithms.

In Paper II, we extended the L∗ algorithm to be able to infer the registerautomata models presented in Paper I, i.e., where data values can be compared

65

Page 66: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

for equality. We implemented our algorithm in LearnLib and used a smallexample to compare its performance to classic L∗ with an abstraction on datavalues. The results were positive, and showed that with our approach, muchfewer states were needed. An important contribution in this paper is that it isthe first active automata learning algorithm to be able to infer register automatausing an L∗-like approach.

Paper III presents a generalization (SL∗) of the algorithm in Paper II tolearning register automata that are parameterized on a theory, i.e., the modelsalso introduced in Paper III. The theoretical basis for the new algorithm is oursymbolic Nerode congruence and Myhill-Nerode theorem. We implementedthis algorithm prototypically in LearnLib and illustrated its versatility using aseries of small examples that showcased how it could be used to infer registerautomata on different theories. An important contribution in this paper is thatit is the first active automata learning algorithm to be able to infer registerautomata where data values are compared for more relations than just equality.

Using automata learning in practice.In Paper IV, we introduced RALib, an extension to the LearnLib frameworkfor active automata learning. RALib makes the SL∗ algorithm more useful inrealistic scenarios, by adding several new features: the ability to infer mod-els with input and output, typed data parameters, and mixing several theoriesin the same model. RALib also adds functionality for inferring models with’fresh’ (i.e. previously unseen) data parameters for data languages over thetheory of equality. With RALib’s built-in Java class analyzer, users can di-rectly infer models of Java classes.

We compared RALib to other similar tools, and showed that RALib is oftenmore efficient in terms of performance, by using fewer queries to infer thesame models. RALib also provides more features: e.g., other tools only learndata languages over the theory of equality, and none of them can handle datatypes.

7.1 Future workThere are some directions for future work that we would like to explore, andwe outline them here.

The register automata formalisms we have defined (Papers I and III) couldbe optimized for different purposes. For example, when modeling the behav-ior of a container or data structure of size n, our automata will need n registers.Sometimes, however, we are only interested in a subset of a component’s be-havior, such as popping an element from a nonempty stack. In this case, itdoes not matter how many values are in the stack, only that there is at leastone, so it would make sense to not have to store all the values in registerswhen modeling the stack as a register automaton. We would like to investigate

66

Page 67: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

how we could amend our register automaton model to take into account thesetypes of behaviors. Moreover, we would like to extend automata learning tothese new models.

The SL∗ algorithm (Paper III) could be optimized in several ways. Onepossible optimization would be to investigate the trade-off between member-ship/tree queries and equivalence queries. Typically, fewer membership/treequeries result in intermediate hypotheses that are not very accurate, so the al-gorithm needs more equivalence queries. Since equivalence queries are moredifficult to implement than membership/tree queries, it might be interesting totry to shift the balance towards more accurate intermediate hypotheses. Wealready exploit counterexamples quite efficiently by reusing the same one sev-eral times whenever possible, but we could try to find better counterexamples,i.e., ones that expose as many discrepancies as possible between the hypothe-sis and the target language.

The symbolic decision trees and tree oracles could probably be used moreefficiently. Currently, the tree oracle generates symbolic decision trees thatdescribe a fragment of a data language after a particular prefix. The trees areused primarily in three ways: they are compared to other trees for isomor-phism under some renaming, they are refined (i.e., more branches are added),and they form the basis for constructing register automata. However, it is notalways the case that we need all branches and nodes in a symbolic decisiontree for determining whether it is equivalent to another tree, or for using it togenerate a register automaton. Sometimes, two trees only differ in one or twobranches or leaf nodes. In these cases, it would probably be more efficient tojust compare these parts of the trees in order to see whether they are equivalent.We would like to investigate whether it is possible to improve the tree oracleso that it will only generate parts of a tree that are important or significant insome sense, or apply some sort of pruning strategy to the already generatedtrees.

Finally, there are some obvious ways to extend the RALib tool (Paper IV).One interesting thing would be to apply it to more case studies, particularlylarger and more complex components, to study the performance of the tool.Another thing would be to automate the choice of parameter types and theoriesin a model. Currently, these are determined by the user based, in some sense,on prior knowledge of the target component. To improve RALib, we wouldwant to investigate different ways of automatically inferring what theories areneeded to model a particular target component. This could be achieved per-haps by some intelligent test case selection, or using counterexample-drivenapproaches to find an appropriate theory.

67

Page 68: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

References

[1] F. Aarts. Tomte: Bridging the Gap between Active Learning and Real-WorldSystems. PhD thesis, Radboud University Nijmegen, 2014.

[2] F. Aarts, J. de Ruiter, and E. Poll. Formal models of bank cards for free. InProc. ICSTW 2013, pages 461–468, 2013.

[3] F. Aarts, P. Fiterau-Brostean, H. Kuppens, and F. W. Vaandrager. Learningregister automata with fresh value generation. In Theoretical Aspects ofComputing - ICTAC 2015 - 12th International Colloquium Cali, Colombia,October 29-31, 2015, Proceedings, pages 165–183, 2015.

[4] F. Aarts, F. Heidarian, H. Kuppens, P. Olsen, and F. W. Vaandrager. Automatalearning through counterexample guided abstraction refinement. In Proc. FM2012, volume 7436 of LNCS, pages 10–27. Springer, 2012.

[5] F. Aarts, F. Howar, H. Kuppens, and F. W. Vaandrager. Algorithms for inferringregister automata - a comparison of existing approaches. In Proc. ISoLA 2014,Part I, pages 202–219, 2014.

[6] F. Aarts, B. Jonsson, J. Uijen, and F. Vaandrager. Generating models ofinfinite-state communication protocols using regular inference with abstraction.Formal Methods in System Design, pages 1–41, 2014.

[7] F. Aarts, H. Kuppens, J. Tretmans, F. W. Vaandrager, and S. Verwer. Learningand testing the bounded retransmission protocol. In Proc. ICGI 2012, pages4–18, 2012.

[8] F. Aarts, J. Schmaltz, and F. W. Vaandrager. Inference and abstraction of thebiometric passport. In Proc. ISoLA 2010, Part I, pages 673–686, 2010.

[9] R. Alur and D. L. Dill. A theory of timed automata. Theor. Comput. Sci.,126(2):183–235, Apr. 1994.

[10] T. Andrews, S. Qadeer, S. K. Rajamani, J. Rehof, and Y. Xie. Zing: A modelchecker for concurrent software. In Computer Aided Verification, 16thInternational Conference, CAV 2004, Boston, MA, USA, July 13-17, 2004,Proceedings, pages 484–487, 2004.

[11] T. Andrews, S. Qadeer, S. K. Rajamani, and Y. Xie. Zing: Exploiting programstructure for model checking concurrent software. In CONCUR 2004 -Concurrency Theory, 15th International Conference, London, UK, August 31 -September 3, 2004, Proceedings, pages 1–15, 2004.

[12] D. Angluin. Learning regular sets from queries and counterexamples. Inf.Comput., 75(2):87–106, 1987.

[13] C. Baier and J. Katoen. Principles of model checking. MIT Press, 2008.[14] M. Barnett, B. yuh Evan Chang, R. Deline, B. Jacobs, and K. R. Leino. Boogie:

A modular reusable verifier for object-oriented programs. In Formal Methodsfor Components and Objects: 4th International Symposium, FMCO 2005,volume 4111 of Lecture Notes in Computer Science, pages 364–387. Springer,2006.

68

Page 69: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

[15] M. Benedikt, C. Ley, and G. Puppis. What you must remember whenprocessing data words. In AMW, 2010.

[16] T. Berg, B. Jonsson, and H. Raffelt. Regular inference for state machines usingdomains with equality tests. In Proc. FASE, volume 4961 of LNCS, pages317–331, 2008.

[17] H. Björklund and T. Schwentick. On notions of regularity for data languages. InE. Csuhaj-Varjú and Z. Ésik, editors, Fundamentals of Computation Theory,volume 4639 of Lecture Notes in Computer Science, pages 88–99. SpringerBerlin Heidelberg, 2007.

[18] T. Bohlin. Regular Inference for Communication Protocol Entities. PhD thesis,Uppsala UniversityUppsala University, Division of Computer Systems,Computer Systems, 2009.

[19] M. Bojanczyk, C. David, A. Muscholl, T. Schwentick, and L. Segoufin.Two-variable logic on data words. ACM Trans. Comput. Logic,12(4):27:1–27:26, July 2011.

[20] M. Bojanczyk, B. Klin, and S. Lasota. Automata with group actions. In Logicin Computer Science (LICS), 2011 26th Annual IEEE Symposium on, pages355–364, June 2011.

[21] M. Bojanczyk, A. Muscholl, T. Schwentick, L. Segoufin, and C. David.Two-variable logic on words with data. In 21th IEEE Symposium on Logic inComputer Science (LICS 2006), 12-15 August 2006, Seattle, WA, USA,Proceedings, pages 7–16, 2006.

[22] B. Bollig, P. Habermehl, M. Leucker, and B. Monmege. A fresh approach tolearning register automata. In Proc. DLT 2013, volume 7907, pages 118–130,2013.

[23] B. Bollig, P. Habermehl, M. Leucker, and B. Monmege. A robust class of datalanguages and an application to learning. Logical Methods in ComputerScience, 10(4), 2014.

[24] M. Botincan and D. Babic. Sigma*: symbolic learning of input-outputspecifications. In Proc. POPL 2013, pages 443–456. ACM, 2013.

[25] P. Bouyer. A logical characterization of data languages. Inf. Process. Lett.,84(2):75–85, Oct. 2002.

[26] G. Brat, K. Havelund, S. Park, and W. Visser. Model checking programs. InIEEE International Conference on Automated Software Engineering (ASE),pages 3–12. Citeseer, 2000.

[27] M. Broy, B. Jonsson, J.-P. Katoen, M. Leucker, and A. Pretschner, editors.Model-Based Testing of Reactive Systems, volume 3472 of LNCS. Springer,2004.

[28] S. Cassel, F. Howar, B. Jonsson, M. Merten, and B. Steffen. A succinctcanonical register automaton model. In Proc. ATVA 2011, pages 366–380, 2011.

[29] S. Cassel, F. Howar, B. Jonsson, and B. Steffen. Learning extended finite statemachines. In Proc. SEFM 2014, pages 250–264, 2014.

[30] S. Cassel, B. Jonsson, F. Howar, and B. Steffen. A succinct canonical registerautomaton model for data domains with binary relations. In Proc. ATVA 2012,pages 57–71, 2012.

[31] C. Y. Cho, D. Babic, P. Poosankam, K. Z. Chen, E. X. Wu, and D. Song.MACE: Model-inference-assisted concolic exploration for protocol and

69

Page 70: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

vulnerability discovery. In USENIX Security Symposium, 2011.[32] C. Y. Cho, D. Babic, E. C. R. Shin, and D. Song. Inference and analysis of

formal models of botnet command and control protocols. In Proceedings of the17th ACM Conference on Computer and Communications Security, CCS 2010,Chicago, Illinois, USA, October 4-8, 2010, pages 426–439, 2010.

[33] E. M. Clarke, O. Grumberg, and D. Peled. Model checking. MIT Press, 2001.[34] W. Cui, J. Kannan, and H. J. Wang. Discoverer: Automatic protocol reverse

engineering from network traces. In USENIX Security, pages 199–212, 2007.[35] L. D’Antoni and M. Veanes. Minimization of symbolic automata. In Proc.

POPL 2014, pages 541–553, New York, NY, USA, 2014. ACM.[36] C. De La Higuera. A bibliographical study of grammatical inference. Pattern

recognition, 38(9):1332–1348, 2005.[37] S. Demri and R. Lazic. Ltl with the freeze quantifier and register automata.

ACM Trans. Comput. Logic, 10(3):16:1–16:30, Apr. 2009.[38] J. Engelfriet and H. Hoogeboom. Tree-walking pebble automata. In

J. Karhumäki, H. Maurer, G. Paun, and G. Rozenberg, editors, Jewels areForever, pages 72–83. Springer Berlin Heidelberg, 1999.

[39] D. Figueira, P. Hofman, and S. Lasota. Relating timed and register automata. InProc. EXPRESS’10, volume 41 of EPTCS, pages 61–75, 2010.

[40] D. Giannakopoulou, Z. Rakamaric, and V. Raman. Symbolic learning ofcomponent interfaces. In Proc. SAS 2012, volume 7460 of LNCS, pages248–264. Springer, 2012.

[41] O. Grinchtein, B. Jonsson, and M. Leucker. Learning of event-recordingautomata. Theor. Comput. Sci., 411(47):4029–4054, 2010.

[42] O. Grumberg, O. Kupferman, and S. Sheinvald. Variable automata over infinitealphabets. In Proc. LATA 2010, volume 6031 of LNCS, pages 561–572.Springer, 2010.

[43] G. Holzmann. Spin Model Checker, the: Primer and Reference Manual.Addison-Wesley Professional, first edition, 2003.

[44] P. Hooimeijer and M. Veanes. An evaluation of automata algorithms for stringanalysis. In Proceedings of the 12th International Conference on Verification,Model Checking, and Abstract Interpretation, VMCAI’11, pages 248–262,Berlin, Heidelberg, 2011. Springer-Verlag.

[45] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to automata theory,languages, and computation - (2. ed.). Addison-Wesley series in computerscience. Addison-Wesley-Longman, 2001.

[46] F. Howar, M. Isberner, B. Steffen, O. Bauer, and B. Jonsson. Inferring semanticinterfaces of data structures. In Proc. ISoLA 2012, Part I, pages 554–571, 2012.

[47] M. Isberner, F. Howar, and B. Steffen. The open-source learnlib - A frameworkfor active automata learning. In Computer Aided Verification - 27thInternational Conference, CAV 2015, San Francisco, CA, USA, July 18-24,2015, Proceedings, Part I, pages 487–495, 2015.

[48] R. Jetley, S. P. Iyer, and P. L. Jones. A formal methods approach to medicaldevice review. Computer, 39(4):61–67, 2006.

[49] M. Kaminski and N. Francez. Finite-memory automata. Theor. Comput. Sci.,134(2):329–363, Nov. 1994.

[50] M. Kaminski and D. Zeitlin. Finite-memory automata with non-deterministic

70

Page 71: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

reassignment. International Journal of Foundations of Computer Science,21(05):741–760, 2010.

[51] D. Kozen. Automata and computability. Undergraduate texts in computerscience. Springer, 1997.

[52] M. Z. Kwiatkowska, G. Norman, and D. Parker. Probabilistic model checkingin practice: case studies with PRISM. SIGMETRICS Performance EvaluationReview, 32(4):16–21, 2005.

[53] M. Z. Kwiatkowska, G. Norman, and D. Parker. PRISM 4.0: Verification ofprobabilistic real-time systems. In Computer Aided Verification - 23rdInternational Conference, CAV 2011, Snowbird, UT, USA, July 14-20, 2011.Proceedings, pages 585–591, 2011.

[54] F. Laroussinie and K. G. Larsen. CMC: A tool for compositionalmodel-checking of real-time systems. In Formal Description Techniques andProtocol Specification, Testing and Verification, FORTE XI / PSTV XVIII’98,IFIP TC6 WG6.1 Joint International Conference on Formal DescriptionTechniques for Distributed Systems and Communication Protocols (FORTE XI)and Protocol Specification, Testing and Verification (PSTV XVIII), 3-6November, 1998, Paris, France, pages 439–456, 1998.

[55] K. G. Larsen, P. Pettersson, and W. Yi. Uppaal in a nutshell. InternationalJournal on Software Tools for Technology Transfer, 1(1-2):134–152, 1997.

[56] B. Livshits. Improving Software Security with Precise Static and RuntimeAnalysis. PhD thesis, Stanford University, Stanford, California, 2006.

[57] O. Maler and I. Mens. Learning regular languages over large alphabets. InProc. TACAS 2014, pages 485–499, 2014.

[58] L. Mariani, F. Pastore, and M. Pezzè. Dynamic analysis for diagnosingintegration faults. IEEE Trans. Software Eng., 37(4):486–508, 2011.

[59] M. Merten, B. Steffen, F. Howar, and T. Margaria. Next generation LearnLib.In Proc. TACAS 2011, pages 220–223, 2011.

[60] J. Myhill. Finite automata and the representation of events. In WADDTR-57-624, pages 112–137. Wright Patterson AFB, Ohio, 1957.

[61] A. Nerode. Linear automaton transformations. Proceedings of the AmericanMathematical Society, 9(4):541–544, 1958.

[62] F. Neven, T. Schwentick, and V. Vianu. Finite state machines for strings overinfinite alphabets. ACM Trans. Comput. Logic, 5(3):403–435, July 2004.

[63] H. Raffelt, B. Steffen, T. Berg, and T. Margaria. Learnlib: a framework forextrapolating behavioral models. STTT, 11(5):393–407, 2009.

[64] R. L. Rivest and R. E. Schapire. Inference of finite automata using homingsequences. Inf. Comput., 103(2):299–347, 1993.

[65] H. Sakamoto and D. Ikeda. Intractability of decision problems forfinite-memory automata. Theoretical Computer Science, 231(2):297 – 308,2000.

[66] D. C. Schmidt. Guest editor’s introduction: Model-driven engineering.Computer, 39(2):0025–31, 2006.

[67] L. Segoufin. Automata and logics for words and trees over an infinite alphabet.In CSL, pages 41–57, 2006.

[68] G. Shu and D. Lee. Testing security properties of protocol implementations - amachine learning based approach. In Proc. ICDCS 2007, page 25. IEEE, 2007.

71

Page 72: Title Page Dummy - Uppsala University · IS. Cassel, F. Howar, B. Jonsson, M. Merten, B. Steffen: A succinct canonical register automaton model. In Journal of Logical and Algebraic

[69] N. Tzevelekos. Fresh-register automata. In Proceedings of the 38th ACMSIGPLAN-SIGACT Symposium on Principles of Programming Languages,POPL 2011, Austin, TX, USA, January 26-28, 2011, pages 295–306, 2011.

[70] M. Utting and B. Legeard. Practical Model-Based Testing - A Tools Approach.Morgan Kaufmann, 2007.

[71] M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjørner. Symbolicfinite state transducers: algorithms and applications. In Proc. POPL 2012,pages 137–150, 2012.

[72] N. Walkinshaw, K. Bogdanov, M. Holcombe, and S. Salahuddin. Reverseengineering state machines by interactive grammar inference. In 14th WorkingConference on Reverse Engineering (WCRE 2007), 28-31 October 2007,Vancouver, BC, Canada, pages 209–218, 2007.

[73] H. Xiao, J. Sun, Y. Liu, S.-W. Lin, and C. Sun. TzuYu: Learning statefultypestates. In Proc. ASE 2013, pages 432–442. IEEE, 2013.

72