demonstrating the feasibility of automatic game …game balancing is an important part of the...

Demonstrating the Feasibility ofAutomatic Game Balancing

Vanessa Volz Günter Rudolph Boris Naujoks

Abstract

Game balancing is an important part of the (computer) game design process, in which designers adapt agame prototype so that the resulting gameplay is as entertaining as possible. In industry, the evaluationof a game is often based on costly playtests with human players. It suggests itself to automate thisprocess using surrogate models for the prediction of gameplay and outcome. In this paper, the feasibility ofautomatic balancing using simulation- and deck-based objectives is investigated for the card game toptrumps. Additionally, the necessity of a multi-objective approach is asserted by a comparison with theonly known (single-objective) method. We apply a multi-objective evolutionary algorithm to obtain decksthat optimise objectives, e.g. win rate and average number of tricks, developed to express the fairness andthe excitement of a game of top trumps. The results are compared with decks from published top trumpsdecks using simulation-based objectives. The possibility to generate decks better or at least as good as decksfrom published top trumps decks in terms of these objectives is demonstrated. Our results indicate thatautomatic balancing with the presented approach is feasible even for more complex games such as real-timestrategy games.

I. Introduction

The increasing complexity and popularity of(computer) games result in numerous chal-lenges for game designers. Especially fine-tuning game mechanics, which affects the feeland required skill profile of a game signifi-cantly, is a difficult task. For example, chang-ing the time between shots for the sniper riflein Halo 3 from 0.5 to 0.7 seconds impacted thegameplay significantly according to designerJaime Griesemer1.

It is important to draw attention to the factthat the game designer’s vision of a game canrarely be condensed into just one intendedgame characteristic. In competitive games, forexample, it is certainly important to considerfairness, meaning that the game outcome de-pends on skill rather than luck (skill-based) andthat the win rate of two equally matched play-ers is approx. 50% (unbiased). But additionally,the outcome should not be deterministic andentail exciting gameplay, possibly favouringtight outcomes.

It therefore suggests itself to support the

balancing process with tools that can automati-cally evaluate and suggest different game pa-rameter configurations, which fulfil a set ofpredefined goals (cf. [14]). However, since theeffects of certain goals tend to be obscure atthe time of design, we suggest to use a multi-objective approach which allows to postponethe decision on a configuration until the trade-offs can be observed. In this paper, we in-troduce game balancing as a multi-objectiveoptimisation problem and demonstrate the fea-sibility of automating the process in a casestudy.

For the purpose of this paper, we definegame balancing as the modification of param-eters of the constitutive and operational rulesof a game (i.e. the underlying physics and theinduced consequences / feedback) in order toachieve optimal configurations in terms of a setof goals. To this end, we analyse the card gametop trumps and different approaches to balanceit automatically, demonstrating the feasibilityand advantages of a multi-objective approachas well as possibilities to introduce surrogatemodels.

1http://www.gdcvault.com/play/1012211/Design-in-Detail-Changing-the

1

arX

iv:1

603.

0379

5v1

[cs

.HC

] 1

1 M

ar 2

016

http://www.gdcvault.com/play/1012211/Design-in-Detail-Changing-the

In the following section, we present relatedwork on top trumps, balancing for multiplayercompetitive games and gameplay evaluations.The subsequent section highlights some impor-tant concepts specific to the game top trumpsand multi-objective optimisation, before the fol-lowing section details our research approachincluding the research questions posed. After-wards, the results of our analysis are presentedand discussed, before we finish with a conclu-sion and outlook on future work.

II. Related Work

Cardona et al. use an evolutionary algorithmto select cards for top trumps games from opendata [4]. The focus of their research, however,is the potential to teach players about data andlearn about it using games. The authors de-velop and use a single-objective dominance-related measure to evaluate the balance of agiven deck. This measure is used as a referencein this paper (cf. fD in Sec. IV).

Jaffe introduces a technique called re-stricted play that is supposed to enable de-signer to express balancing goals in terms ofthe win rate of a suitably restricted agent [11].However, this approach necessitates expertknowledge about the game as well as an AIand several potentially computationally expen-sive simulations. In contrast, we explore otherpossibilities to express design goals and utilisenon-simulation based metrics.

Chen et al. intend to solve “the balanceproblem of massively multiplayer online role-playing games using co-evolutionary program-ming” [5]. However, they focus on level pro-gression and ignore any balancing concernsapart from equalising the win-rates of differentin-game characters.

Yet, most work involving the evaluation ofa game configuration is related to proceduralcontent generation, specifically map or levelgeneration. Several papers focus on issuingguarantees, e.g. with regards to playability [18],solvability [17], or diversity [15, 10]. Other re-search areas include dynamic difficulty adapta-tion for single-player games [9], the generation

of rules [2, 16], and more interactive versionsof game design, e.g. mixed-initiative [13].

III. Basics

In the following, the game top trumps is in-troduced and theoretical background for theapplied methods from multi-objective optimi-sation and performance evaluation is sum-marised.

I. Top Trumps

Top trumps is a themed card game originallypublished in the 1970s and relaunched in 1999.Popular themes include cars, motorcycles, andaircrafts. Each card in the deck corresponds toa specific member of the theme (such as a carmodel for cars) and displays several of its char-acteristics, such as acceleration, cubic capacity,performance, or top speed. An example can befound in Fig. 1.

Figure 1: Card from a train-themed top trumps deckwith 5 categories (winningmoves.co.uk)

At the start of a game, the deck is shuffledand distributed evenly among players. Thestarting player chooses a characteristic whosevalue is then compared to the correspondingvalues on the cards of the remaining players.The player with the highest value receives all

2

cards in the trick and then continues the gameby selecting a new attribute from their nextcard. The game usually ends when at least oneplayer has lost all their cards. However, forthe purpose of this paper, we end the gameafter all cards have been played once in orderto avoid possible issues of non-ending games.

II. Multi-objective optimisation

Switching from single- to multi-objective opti-misation has advantages and disadvantages [6].While it is often better to consider multiple ob-jectives for technical optimisation tasks, thecomplete order of individuals is lost in thiscase. Instead, one has to handle incomparablesolutions, e.g. two solutions that are better thanthe other one in at least one objective. This ob-jective or component wise approach goes backto the definition of Pareto dominance. A solu-tion or individual x is said to strictly (Pareto)dominate another solution y (denoted x ≺ y) iffx is better than y in all objectives. Consideringminimisation this reads

x ≺ y iff ∀i ∈ {1, . . . , m} : f (xi) < f (yi)

under fitness functionf : X ⊂ Rn → Rm, f (x) =

(f1(x), . . . , fm(x)

).

Based on this, the set of (Pareto) non-dominated solutions (Pareto set) is defined asthe set of incomparable solutions as definedabove and the Pareto front to be the image ofthe Pareto set under fitness function f .

Nevertheless, even incomparable solutionsneed to be distinguished when it comes to se-lection in an evolutionary algorithm. In the evo-lutionary multi-objective optimiser considered,SMS-EMOA [1], this is done based on the con-tribution to the hypervolume (i.e the amount ofobjective space covered by a Pareto front w.r.t.a predefined reference point). The contributionof a single solution to the overall hypervolumeof the front is used as the secondary rankingcriterion for the (µ + 1)-approach to selection.The first one is the non-dominated sorting rankassigned to each solution.

For measuring the performance of differentSMS-EMOA runs, we consider two other per-formance indicators next to the hypervolume

of the resulting Pareto fronts. These are theadditive ε indicator as well as the R2 indicator,all presented by Knowles et al. [12]. These indi-cators are also considered for the terminationof EMOA runs using online convergence detec-tion as introduced by Trautmann et al. [19].

For variation in SMS-EMOA, the mostwidely used operators in the field are consid-ered, namely simulated binary crossover andpolynomial mutation. These are parametrisedusing pc = 1.0, pm = 1/n, ηc = 20.0, andηm = 15.0, respectively, cf. Deb [6].

III. Performance Measurement forStochastic Multi-objective Optimisers

The unary performance indicators introducedabove only express the performance of a singleoptimisation run. However, to evaluate andcompare the relative performance of stochasticoptimisers with potentially significantly dif-ferent outcomes (such as evolutionary multi-objective algorithms), measurements for thestatistical performance are needed.

For this purpose, (empirical) attainmentfunctions that describe the sets of goalsachieved with different approaches were pro-posed, expressing both the quality of theachieved solutions as well as their spread overmultiple optimisation runs [7]. Based on thesefunctions, the set of goals that are reached in50% (or other quantiles) of the runs of the opti-misers can be computed (also known as 50%-attainment surface). Comparing the attainmentsurfaces of different optimisers is already amuch better indicator for their performancesthan comparing the best solutions achieved, asthey are subject to stochastic influences.

Additionally, Fonseca et al. detail a statisti-cal testing procedure (akin to a two-sampletwo-sided Kolmogorov-Smirnov hypothesistest) based on the first- and second-order attain-ment-functions of two optimisers [7]. If the nullhypotheses of these tests are rejected, it can beassumed that the differences in performanceof the considered optimisers are statisticallysignificant.

3

IV. Approach

For the remainder of this paper, we denote thenumber of cards in a deck as K (even number)and the number of characteristics (categories)displayed on a card L. Two representations areused for a deck, (1) as a vector x ∈ RKL forthe evolutionary algorithm and (2) as a K× Lmatrix V for easier understanding.

Accordingly, the value of the k-th card inthe l-th category is vk,l with k ∈ {1, . . . , K}, l ∈{1, . . . , L}. The k-th card in a deck is vk,· =(vk,1, . . . , vk,L). A partial order for the cardscan be expressed with vk1,· � vk2,· meaningthat card vk2,· beats vk1,· in all categories

In this paper, we only consider decks thatfulfil two basic requirements we deem existen-tial for entertaining gameplay:• all cards in the deck are unique:@(k1, k2) ∈ {1, . . . , K}2, k1 6= k2 withvk1,· = vk2,·

• there is no strictly dominant card in thedeck:@k1 ∈ {1, . . . , K} with vk2,· ≺ vk1,·∀k2 ∈{1, . . . , K}(in this case, dominant cards have largervalues, since higher values win accordingto the game rules)

We consider two agents p4, p0 with differ-ent knowledge about the played deck:• p4 knows the exact values of all cards in

the deck• p0 only knows the valid value range for

all values vk,lBoth agents are able to perfectly rememberwhich cards have been played already. Playerp4 is expected to perform better than p0 onaverage on a balanced deck. In order to reducethe number of simulations needed to verifythis, only games of a player p4 against p0 willbe considered here.

In our simulation, both of these agents es-timate the probabilities to win with each cat-egory on a given card with consideration oftheir respective knowledge about the deck aswell as the cards already played. p0 thereforehas to assume a uniform distribution and willonly take the values of their current card into

account. p4, in contrast, is able to model theprobability more precisely by accounting forthe number of cards with a higher value ineach category that are still in play.

Let RG be the number of simulation runs.The number of tricks that p4 received at the endof the r-th game (r ∈ {1, . . . , RG}) with deckV will be called t(r,V)

4 henceforth, and thus iff

t(r,V)4 > K

2 , p4 won the game, iff t(r,V)4 = K

2 the

game was a draw, and else, p4 lost. t(r,V)c is

the number of times the player choosing thecategory did not win the trick in round r of thegame with deck V, i.e. the number of times theplayer announcing the categories changed.

Since optimisation tasks are generally as-sumed to be minimisation problems (withoutloss of generality), this convention is satisfiedhere as well for the sake of consistency. There-fore, maximisation problems are transformedinto minimisation task by multiplication with−1. In the course of this paper, we compare8 card sets from purchased decks with decksgenerated using three different approaches andcorresponding fitness functions detailed in thefollowing:• Single-objective optimisation according

to the dominance-related (D) measureproposed in [4] which describes the dis-tance of the cards in a deck V to thePareto front:

fD(V) = − 1K

K

∑k=1

K

∑i=1

(1− 1(vk ≺ vi))

• Multi-objective optimisation with simula-tion-based measures developed with ex-pert knowledge that are supposed to ex-press the decks V’s fairness, excitement,and resulting balance (B):

fB(V) =

(− 1

RG

RG

∑r=1

1

(t(r,V)4 >

K2

),

− 1RG

RG

∑r=1

t(r,V)c ,

1RG

RG

∑r=1

∣∣∣∣2t(r,V)4 − K

2

∣∣∣∣)

.

• Multi-objective optimisation with simula-tion-independent measures developed inthe pre-experimental planning phase (cf.

4

Sec. II) as surrogate (S) for (the simula-tion-based) fitness fB of different decks:

fS(V) =(−hv(V),

− sd({avg(v·,l)|l ∈ {1, . . . , L}})),

with the dominated hypervolume hv of adeck V, sd the empirical standard devia-tion and avg the average.

In order to compare the aforementionedapproaches, an SMS-EMOA is used to approxi-mate the Pareto front for the fitness functionsfS and fB. Online convergence detection, varia-tion operators, and parameters as described inSec. II are used. For the single-objective fitnessfD, the algorithm was modified as little as pos-sible to enable comparisons.Thus, a (µ + 1)-EAwas used with the same variation operatorsand equivalent selection. The convergence wastested based on the variations of some single-objective performance indicators, namely themin, mean and max fitness values of the activepopulation. The experiments were conductedusing R with the help of the emoa package2

and a related SMS-EMOA implementation3,4.

I. Research questions

The different approaches to finding a balancedtop trumps deck are evaluated and comparedin Sec. V. We focus on the following topics:I Problems of manual balancing and the solu-

tions offered by automationII Feasibility of automatic balancing in terms

of required qualityIII Performance of multi- and single-objective

approachesIV Feasibility of automatic balancing in terms

of computational costs

II. Preexperimental planning

Before any experiments can be executed, thetest case has to be defined more accurately. Thefollowing assumptions are made:

• The number of cards and categories areset to K = 32, L = 4 in accordance withthese values for the purchased decks.• The valid range of all values vl,k is set

to [1, 10] ∈ R, which all decks can betransformed to. This results in an infinitenumber of possible cards, but other op-tions entail the necessity to construct agenotype-phenotype mapping.

Due to the large number of possible carddistributions among the players, the order ofthe cards in a deck, and different starting play-ers, a single deck could potentially result ina large number of different games (4K!). Asa consequence, all simulation-based metricsto evaluate the deck have to be approximated.The values of the metrics in fB, which all ex-press an average, should be as close to the truemean of the respective distributions as possible.To ensure the quality of the approximation, astatistical t-test is conducted to compute thesize of the confidence interval for each metricfor RG between 100 and 10 000 at a confidencelevel of 0.95. This test is repeated 500 timesfor each possible sample size and each met-ric and. Assuming a normal distribution, the.95-quantile is stored as the result. RG = 2 000games are found to be a good tradeoff betweencomputational time and fitness approximationaccuracy for all metrics.

An equivalent test is conducted to decideon the number of optimisation runs necessaryto approximate the performance of the corre-sponding approach (to a suitable confidenceinterval). Here, the HV-, ε- and R2-indicatorsare considered with the Pareto front resultingfrom all of these runs as a reference set. Afterconsidering the results, the number of runs ROwas set to RO = 100.

For the simulation-independent approach,metrics that do not require a simulation are de-veloped. The hypervolume is chosen as a mea-sure to achieve as many non-dominated cardsin each deck as possible. This is expected toimprove the fairness of a deck (also cf. fD). The

2https://cran.r-project.org/web/packages/emoa/3https://github.com/olafmersmann/emoa4Additional code written for the simulation and experiments will be made accessible after publication

5

https://cran.r-project.org/web/packages/emoa/

https://github.com/olafmersmann/emoa

standard deviation of category means is usedto increase the significance of player p0’s disad-vantage, thereby resulting in a higher win-ratefor p4.

Different population sizes are tested andthe resulting Pareto fronts are compared fora quick estimate of their performance. Basedon the results received, approaches are evalu-ated on runs with population sizes of 10 and100 individuals. This accounts for both smallpopulations with a high selection pressure andalso bigger populations with a larger spread.

V. Results

We evaluate all approaches according to thefitness function fB which is based on expertknowledge and therefore assumed to charac-terise a balanced deck best. This assumptionis supported by the fact that most of the pur-chased decks are located on the approximatedPareto front according to these metrics.

In the remainder of the paper, we use theletter corresponding to the fitness function andthe population size to refer to the union of thePareto fronts of RO = 100 runs with the respec-tive fitness function and population size for themulti-objective approaches. The introducedacronym with an added index p refers to thePareto front of the respective set with regardsto fB. A numerical index stands for the attain-ment surfaces to the indicated level. For exam-ple, S10 is the union of all Pareto fronts fromoptimiser runs with population size 10 and fit-ness function fS, S10p is the Pareto front of thisset and S1050 is its 50% attainment surface. Forthe single-objective approach, the union of thebest individuals achieved in RO = 100 runs areconsidered instead, because the populationsconverge to one deck. The set of purchaseddecks will henceforth be denoted PD.

To facilitate the discussion of the experi-ments, the results of the three different ap-proaches are plotted in terms of their perfor-mance on the fitness function fB. Figure 2depicts the sets listed in Tab. 1. Figure 3 vi-sualises the 50%-attainment surfaces as wellas the Pareto fronts resulting from the Pareto

front union for each approach. The legendfor all plots can be found in Tab. 1, wherethe same colour scheme is used to refer to thePareto fronts and attainment surfaces of therespective approaches.

Table 1: Legend providing the color assignment to theresults of different optimiser runs as depictedin the Fig. 2 and Fig. 3. The same colour isassigned to all results from one approach, i.e.B10, B1050 and B10p have the same colour.

B10 D10 S10 PDB100 D100 S100

I. Automatic Balancing Advantages

To evaluate the advantages of automatic bal-ancing, the following hypotheses are proposed.I-C1 The number of tests needed to approxi-

mate some simulation-based metrics for asingle deck to an appropriate accuracy isvery high and possibly exceeds the num-ber of playtests that could reasonably bedone with human players.

I-C2 Many of the purchased decks are unfairin the sense that the game’s outcome de-pends strongly on luck and less on theplayers’ skill levels.

With these hypotheses, the effort needed formanual balancing is considered and the per-formance of PD (i.e. likely manually balanceddecks) is compared to that of automaticallybalanced decks.

The t-test described in Sec. II already de-termined that the best tradeoff between theaccuracy of the approximation of simulation-based metrics and the number of simulationsRG was ≈ 2 000. Considering the large effortplaytests with humans necessitate as well asthe bias induced by having different playersplay, testing 2 000 rounds of a game with hu-mans is tedious and potentially not even pos-sible on smaller budgets. For example, thestandard deviation on fB for the decks in B10is ≈ (0.0427, 0.3191, 0.3576). If we assumewe had 100 players play 10 games each, theresulting confidence interval for α = 0.05 is

6

Figure 2: Union of all Pareto fronts of RO = 100 runs for all approaches considered (refer to legend in Fig. 1) from twodifferent perspectives. Elements of the shared Pareto front are depicted with larger squares.

Figure 3: 50%-attainment surfaces (left) and Pareto fronts (rights) per approach of the union of all Pareto fronts fromRO = 100 optimisation runs (depicted in Fig. 2). Refer to legend in Fig. 1 for the colour scheme. Elements ofthe shared Pareto front are depicted with larger squares.

≈ (0.0442, 0.4298, 0.15), which would not allowthe designers to distinguish between differentsolutions and is therefore not accurate enough.The standard deviation as well as the numberof games needed would likely increase withthe complexity of the game as well. Therefore,a definitive advantage of automatic balancingover manual playtests is the possibility of aquantitative analysis of simulation-based met-rics (cf. [11]).

Except for a single deck (motor cycles), thepurchased decks are all on the edge of theestimated Pareto front with the worst perfor-mances in terms of the win-rate of p4. This is

obvious in Fig. 3. The low win-rates for p4are probably due to the fact that the number ofnon-dominated cards in those decks in PD isrelatively low. The fD average is ≈ −24.12 com-pared to optimum −31 (only non-dominatedcards). This means that the resulting game-play depends heavily on luck because there arecard combinations with which a player sim-ply cannot win regardless of their skill. Theonly exception is the motor cycle deck with avalue for fD of −30.4375 and a much better p4win-rate of approx. 0.8.

Thus, we demonstrated that the effortneeded to evaluate one deck is beyond a rea-

7

sonable number of playthroughs. Additionally,the manually balanced decks are located onthe extreme edges of the Pareto front, imply-ing that it is difficult to find less extreme so-lutions manually. This also suggests that theapproximation of the Pareto front could help adesigner by giving them a more sophisticatedidea about the characteristics of their game andpotential alternatives. The findings by Nelsonand Mateas also connote that designers see po-tential in automatic balancing tools to supportthe balancing process [14].

II. Automatic Balancing Quality

Next, we demonstrate the feasibility of auto-matic balancing, i.e. that at least some of theautomatically balanced decks perform on parwith the purchased decks.II-C1 Automatically balanced decks are on the

Pareto front.II-C2 The results for II-C1 are statistically sig-

nificant.As is obvious from the plots (especially theright plot in Fig. 3), automatically balanceddecks (S10 and B10) make up a large part ofthe Pareto front and are thus not dominatedby the purchased decks. Moreover, most pur-chased decks are concentrated at the extremeedges of the front.

The same is true for individuals in S1050and B1050, which ensures that, despite thestochasticity of the approach, decks on thePareto front can be achieved in at least 50%of all optimisation runs, making the result sta-tistically relevant.

III. Single- and Multi-objective perfor-mance

We analyse whether the multi-objectificationof the approach used by Cardona et al. [4]could result in better performing individuals.Therefore we extend fD to fS. The biggerthe dominated hypervolume (the first part offS), the more non-dominated cards are in adeck, which is expressed by fD. The hypervol-

ume additionally favours cards with a largerspread, which does not affect the dominance-relationship (or the outcome of a playthroughor any simulation-based fitness values). Thesecond part of fS is the standard deviation ofthe category means. The higher the deviation,the more problematic is the strategy of playerp0 to assume uniform distributions for the cate-gories to make up for their lack of knowledge.

III-C1 There is a significant difference betweenthe (empirical) attainment functions ofthe considered multi-objective (S10, S100)and single-objective (D10, D100) optimi-sation approaches.

III-C2 The results of the single-objective ap-proach (D10, D100) perform signifi-cantly worse than the multi-objectiveones (S10,S100) in terms of fB.

In order to test hypothesis III-C1, the statisticaltesting procedure for the comparison of empir-ical attainment surfaces described by Fonsecaet al. [7] is conducted using software publishedby C. Fonseca5. With 10 000 random mutationsand α = 0.05, the null hypothesis (the attain-ment function of two approaches are equallydistributed) is rejected with a p-value of 0 (crit-ical value 0.23, test statistic 1) for all compar-isons in {D10, D100} × {S10, S100}. This re-sult was expected as, judging from the visuali-sations (e.g. in Fig. 2), the individuals foundby the analysed approaches are in very differ-ent areas. Additionally, the results found bythe single-objective approach have a very lowspread, which is probably owed to the charac-ter of the fitness measure fD.

The sets of solutions found for the single-objective approaches are both strictly domi-nated by both surrogate approaches accordingto the definition by Knowles et al. [12]. For-mally, it holds that

(D1050 ∪ D10050) ≺ (D10∪ D100)

≺ S1050 ≺ S10

(D1050 ∪ D10050) ≺ (D10∪ D100)

‖ S10050 ≺ S100.

5https://eden.dei.uc.pt/~cmfonsec/software.html

8

https://eden.dei.uc.pt/~cmfonsec/software.html

The test for hypothesis III-C1 has shown thatthe attainment functions of the approaches arenot the same. This indicates that using fitnessfunction fS instead of fD has improved theresults in terms of fB, thus confirming III-C2.This is suggests that the multi-objectificationof fD can indeed improve the achieved resultsin this case.

IV. Computational Costs and Surro-gate Objectives

We now address the feasibility of automaticbalancing in terms of computational costs. InSec. II, we have already analysed and verifiedits feasibility on function fB. Therefore, thecomputational costs needed with RG = 2 000simulations per game and RO = 100 optimi-sation runs are obviously manageable for theconsidered application.

However, a simulation-based approach tobalancing might prove too costly for more com-plex games with computationally expensivesimulations or a large game-state space. Weapproach this problem by investigating the pos-sibility of using simulation independent mea-sures (e.g fS) instead of fB. Naturally, in prac-tice these measures would need to be devel-oped in accordance with the intended balanc-ing goals and observations of the optimisers’behaviour, similar to what is described in Sec.II.

The following hypotheses are put forwardin order to investigate the computational costsof automatic balancing and the feasibility ofsimulation-independent objectives:IV-C1 Some results optimised based on fitness

function fS (S10, S100) are not dominatedby B10 and B100.

IV-C2 The best individuals in S10 and S100perform at least equally well as the onesin B10 and B100 in terms of performanceindicators.

IV-C3 There is no significant differencebetween the attainment functions ofS10, S100 and B10, B100.

IV-C4 The results for IV-C1 and IV-C2 are sta-tistically significant.

As visualised in Fig. 2 and Fig. 3, there areindividuals in S10 on the shared Pareto frontand which are therefore not dominated by anyindividual in B10 ∪ B100. In fact, B100 ≺ S10and B10‖S10. For S100, it can only be said thatS100‖B100, making IV-C1 only true for S10.

In order to compare the performances ofthe Pareto fronts of the approaches consideredhere, the performance indicators for HV, ε andR2 are computed for S10p, S100p, B10p, B100p.To facilitate the interpretation of these values,the aforementioned sets are normalised (result-ing in values between 1 and 2, cf. [19]) withregards to all values achieved (cf. Fig. 2) be-fore computing the indicators. The normalisedPareto front of the union of all achieved frontsis used as a reference set (cf. Fig. 3 (right)).The resulting indicator values can be found inthe upper half of Tab. 2. The non-dominatedsorting ranks in Tab. 2 (top half) clearly showthat hypothesis II-C2 is true for the computedvalues and that the approaches with the samepopulation size perform equally well.

Table 2: Normalised indicator values for the Paretofronts and 50%-attainment surfaces of the con-sidered approaches (cf. Fig. 3) as well as theirresulting ranks. The rank value in brackets arealternatives accounting for statistically insignif-icant distinctions.

set HV ffl R2 rankB10p 0.241 0.329 0.103 1B100p 0.541 0.559 0.203 2 (3)S10p 0.177 0.368 0.091 1S100p 0.475 0.501 0.206 2B1050 0.490 0.555 0.203 1 (1)B10050 0.638 0.650 0.258 3 (2)S1050 0.527 0.565 0.222 2 (1)S10050 0.676 0.692 0.288 4 (3)

In order to test the statical significance ofthis statement, the width w of the confidenceinterval for α = 0.05 for each set and each indi-cator is computed. This is done using a t-testto estimate the true indicator means on theseparately achieved performance indicators forRO = 100 runs for each approach, normalised

9

as before. When accounting for the uncertaintyexpressed in the confidence intervals, all dif-ferences in performance indicators in Tab. 2(top half) are statistically significant except forthe difference in R2 for B10 and S10, as wellas B100 and S100. This means that in the trueranking, B100 could be ranked 3 instead of 2.

The tests used to compare empirical attain-ment functions for hypothesis III-C1 describedin Sec. III are applied again here to comparethe attainment functions for all combinations ofB10, B100 and S10, S100. Contrary to hypothe-sis IV-C3, the tests all reject the null hypothesisof equal attainment functions with a p-valueof 0, although the decisions are a bit tighterthan in Sec. III. Thus, hypothesis IV-C3 cannot be confirmed. The differences in attain-ment functions are likely due to the fact thatthe compared sets occupy different areas inthe objective space (cf. Fig. 2). This can beexplained by failing to express the excitementof a playthrough in fS, which was constructedto better express a deck’s fairness starting fromfD (cf. Sec. III). Therefore, if the goal was toapproximate the solutions obtained from fBwith simulation-independent fitness measures,different ones should be selected, possibly us-ing the p-value of the aforementioned test asan indicator for their quality.

From Fig. 3 (left) it is obvious that bothB1050 and S1050 contain individuals on theshared Pareto front, thus proving that IV-C1is true in at least 50% of optimisation runs.The performance indicators for the respective50%-attainment surfaces for the respective ap-proaches are listed in Tab. 2 (bottom half),along with their ranks and the possible truerank when uncertainty is accounted for. In thiscase, S10050 performs significantly worse thanall the other approaches considered. However,S1050 definitely performs better than B10050and there is no clear ranking of the perfor-mances of S1050 and B1050. This implies thathypothesis IV-C4 is true as well.

Since the values in Tab. 2 are all based onnormalised outcomes, the absolute values canbe compared. As expected, the 50%-attainmentsurfaces all perform worse than their Pareto

front counterpart and the differences are signif-icant. Interestingly, the differences per indica-tor are smaller in the bottom half of Tab. 2. Thisreflects the fact that the distances of the 50%-attainment surfaces of the different approachesin objective space are visibly smaller in Fig. 3(left) than the Pareto fronts in Fig. 3 (right).There are also more individuals in S10050 andB10050 when compared to S100p and B100p,respectively, which explains their smaller lossin performance indicators. This is because bothapproaches experience less spread in the direc-tion of the optimum.

V. Additional Observations

Next to the results discussed previously, someinteresting observations were made during theexperiments.

The single-objective optimisation approachconverges to one deck for both populationsizes tested, even though all decks with exclu-sively non-dominated cards perform equallywell. The optimal fitness value for fD, 31, isachieved in almost all runs. This suggests thatthe algorithm used for single-objective opti-misation including the convergence detectionworked for this application. Furthermore, wecan conclude that fD is not suited for deck gen-eration because it does not distinguish deckswell. This might be entirely different for dataselection as done by Cardona et al. [4].

The optimiser runs were stopped by theconvergence detection mechanism after verydifferent numbers of function evaluations neval ,even for the same approach. For example, thefirst 30 runs for S10 executed between 3 727and 23 737 fitness function evaluations. Thereis no apparent correlation between neval andthe quality of the achieved solutions, with nevalbetween 3 993 and 20 243 for runs with solu-tions on the Pareto front of this subset of S10.This point to a high complexity of the fitnessfunction landscapes and validates the use of on-line convergence detection in this experiment.

10

VI. Conclusion and Outlook

In this paper, we present our approach to auto-matic game balancing (as defined in Sec. I) andapply it to the card game top trumps. Our ap-proach includes the formalisation and interpre-tation of the task as a multi-objective minimi-sation problem which is solved using a state-of-the-art EMOA with online convergence de-tection. The performances of the resulting andpurchased decks next to a single-objective ap-proach [4] are evaluated using statistical analy-ses.

We conclusively show the feasibility of auto-matic game balancing in terms of the quality ofthe achieved solutions for the game top trumpsunder the assumptions detailed in Sec. IV. Be-ing aware that computational concerns couldrender a simulation-based approach infeasiblefor complex applications, an approach to avoidsimulation was outlined in section IV. The pre-sented work, therefore, is a necessary step toproving the feasibility of automatic balancingin general. Additionally, the apparent advan-tages of an automated balancing approach andmulti-objective balancing are discussed as well(cf. Sec. V). These discussions and the addi-tional observations in Sec. V strongly indicatethat the presented approach was suitable andsuccessful.

A possible way to proceed with this workis to further optimise the different parts of theapproach. For example, the considered opti-misers could be improved by better parame-ters, e.g. determined by tuning methods likesequential parameter optimisation, thereby po-tentially sharpening our results. In addition,several other modules should be tested for pos-sible (parameter) improvements like the onlineconvergence detection mechanism.

With respect to the implemented player AI,it seems reasonable to extend our research bytesting different improvements of the proba-bilistic AI used in our study. This could pro-vide interesting results if the restriction of al-lowing exactly K

2 rounds of play is removed. Inthis case, the agent is required to plan ahead

and making more complex strategies profitable.A viable AI extension is inference based rea-soning about the opponent’s cards as demon-strated by Buro et al. in their work on improv-ing state evaluation in trick-based card games.Monte Carlo Search is common used for cardgames as well, as they commonly feature im-perfect information (cf. [8, 20]). Another routewould be the implementation of AIs that imi-tate human players.

Further work on the analysis of the pre-sented measures and the discovery of new onesis intended. As a first step in this direction, wepropose to use our approach for different appli-cations, possibly after developing application-specific methods. In that regard, we aim totest our approach on more complex computergames. A first attempt will be made incorporat-ing The Open Racing Car Simulator (TORCS)6,but further tests on real-time-strategy gamesand platformers are intended as well. Basedon the analysis of different well-performingfitness measures, a next step could be the in-vestigation of generalisable ones.

More importantly, we plan to evaluate ourvision of a balanced deck, our fitness measuresand the results of our methods with surveysfor human players. In our opinion, incorporat-ing human perception of balancing is the onlyacceptable way to achieve the eventual goal, i.e.accurately expressing and maximising humanplayers’ enjoyment of a game.

References

[1] N. Beume, B. Naujoks, and M. Em-merich. SMS-EMOA: Multiobjective Se-lection Based on Dominated Hypervol-ume. European Journal of Operational Re-search, 181(3):1653–1669, 2007.

[2] C. B. Browne. Automatic generation andevaluation of recombination games. Phd the-sis, Queensland University of Technology,2008.

[3] M. Buro, J. R. Long, T. Furtak, and N. R.Sturtevant. Improving state evaluation,

6http://torcs.sourceforge.net/

11

http://torcs.sourceforge.net/

inference, and search in trick-based cardgames. In Conference on Artificial Intelli-gence (IJCAI), pages 1407–1413. MorganKaufmann, San Francisco, CA, 2009.

[4] A. B. Cardona, A. W. Hansen, J. Togelius,and M. G. Friberger. Open Trumps, aData Game. In Foundations of Digital Games(FDG 2014). Society for the Advancementof the Science of Digital Games, SantaCruz, CA, 2014.

[5] H. Chen, Y. Mori, and I. Matsuba. Solvingthe balance problem of massively multi-player online role-playing games usingcoevolutionary programming. Applied SoftComputing, 18:1–11, 2014.

[6] K. Deb. Multi-Objective Optimization UsingEvolutionary Algorithms. Wiley, Chichester,UK, 2001.

[7] C. Fonseca, V. D. Fonseca, and L. Pa-quete. Exploring the performance ofstochastic multiobjective optimisers withthe second-order attainment function. InC. A. C. Coello et al., editors, EvolutionaryMulti-Criterion Optimization (EMO 2005),pages 250–264. Springer, Berlin, 2005. doi:10.1007/b106458.

[8] T. Furtak and M. Buro. Recursive MonteCarlo search for imperfect informationgames. In Computational Intelligence andGames (CIG 2013), pages 225–232. IEEEPress, Piscataway, NJ, 2013.

[9] G. Hawkins, K. V. Nesbitt, and S. Brown.Dynamic Difficulty Balancing for Cau-tious Players and Risk Takers. InternationalJournal of Computer Games Technology, 2012:1–10, 2012.

[10] A. Isaksen, D. Gopstein, J. Togelius, andA. Nealen. Discovering Unique GameVariants. In H. Toivonen et al., edi-tors, Computational Creativity (ICCC 2015).Brigham Young University, Provo, Utah,2015.

[11] A. Jaffe. Understanding Game Balance withQuantitative Methods. Phd thesis, Univer-sity of Washington, 2013.

[12] J. Knowles, L. Thiele, and E. Zitzler. A Tu-torial on the Performance Assessment ofStochastic Multiobjective Optimizers. TIKReport 214, Computer Engineering andNetworks Laboratory (TIK), ETH Zurich,2006.

[13] A. Liapis, G. N. Yannakakis, and J. To-gelius. Sentient sketchbook: Computer-aided game level authoring. In Founda-tions of Digital Games (FDG 2013), pages213–220. Society for the Advancement ofthe Science of Digital Games, Santa Cruz,CA, 2013.

[14] M. J. Nelson and M. Mateas. A require-ments analysis for videogame design sup-port tools. In Foundations of Digital Games(FDG 2009), pages 137–144. ACM Press,New York, 2009.

[15] M. Preuss, J. Togelius, and A. Liapis.Searching for Good and Diverse GameLevels. In Computational Intelligence andGames (CIG 2014), pages 381–388. IEEEPress, Piscataway, NJ, 2014.

[16] A. M. Smith and M. Mateas. VariationsForever: Flexibly generating rulesets froma sculptable design space of mini-games.In Computational Intelligence and Games(CIG 2010), pages 273–280. IEEE Press, Pis-cataway, NJ, 2010. doi: 10.1109/ITW.2010.5593343.

[17] A. M. Smith, E. Butler, and Z. Popovic.Quantifying over Play : Constraining Un-desirable Solutions in Puzzle Design. InFoundations of Digital Games (FDG 2013),pages 221–228. Society for the Advance-ment of the Science of Digital Games,Santa Cruz, CA, 2013.

[18] J. Togelius, M. Preuss, N. Beume, S. Wess-ing, J. Hagelbäck, G. N. Yannakakis, andC. Grappiolo. Controllable procedural

12

map generation via multiobjective evolu-tion. Genetic Programming and EvolvableMachines, 14(2):245–277, 2013.

[19] H. Trautmann, T. Wagner, B. Naujoks,M. Preuss, and J. Mehnen. Statistical Meth-ods for Convergence Detection of Multi-Objective Evolutionary Algorithms. Evo-

lutionary Computation, 17(4):493–509, 2009.doi: 10.1162/evco.2009.17.4.17403.

[20] C. D. Ward and P. I. Cowling. Monte Carlosearch applied to card selection in Magic:The Gathering. In Computational Intelli-gence and Games (CIG 2009), pages 9–16.IEEE Press, Piscataway, NJ, 2009.

13

demonstrating the feasibility of automatic game …game balancing is an important part of the...

Documents