effective testing of free-to-play games

Back a few years ago when social games were exploding on facebook the conventional wisdom was that you wanted to release your minimum viable product as quickly as you could, and iterate on it in the wild with real data from players. But that only made sense in a world where if you “wasted” your early traffic on a poor game it was relatively easy to get more. Mobile is the opposite: traffic is always at a premium and a strong global launch has become crucial to success. It’s the best chance to get substantial features from the platforms, to get noticed in the new charts, and potentially get picked up by recommendation algorithms. Ideally we’d all be like Blizzard & Supercell and be able to polish and test games internally for years, but there are all sorts of pressures and costs pushing games out the door. For companies with <$1B in annual profit it’s important to be realistic about what, why, and how to test to maximize our chance at success.

We started Kongregate back in 2006 as an open platform for browser games, a little like YouTube for games: anyone can upload, we then add chat, comments, forums, achievements, and a lot of other social features that make the site a whole game itself. In 10 years more than 100,000 games have been put on our site, covering pretty much every genre imaginable. A fairly wide array of games are popular, from casual puzzles and launch games to MMOs and collectible card games. Overall our audience trends male, with heavy overlap with console and Steam.

Four years ago we started publishing third party games on mobile, and have launched more than 20 games in the last 3 years. Like on Kongregate itself we publish a fairly broad range of games, from more niche, high monetizing RPGs and CCGs to single-player games with mostly ad monetization. To give you a feel of the range here are the games we published in 2015.

And like everything in life they have different strengths and weaknesses. To use them properly it’s important to understand what those are, so I’m going to spend a bit of time going through the different types. One note: I’m not going to be talking much about defect and bug-oriented QA testing, which is not to say it’s not important. It’s very important, and you should do it, ideally with dedicated in-house resources supplemented by 3rd party resources. But I only have an hour so I have to skimp on some topics.

By Team Playtests I mean both the testing you do naturally as you add on features to the game, and also scheduled team sessions to look at the game more broadly. Game is available, everyone’s getting paid to do this You don’t play like a player, you know how things are supposed to work. And you know games too well – every convention is obvious to you. And since you’re testing to make sure features are working you play the game in totally unnatural ways “I’m jumping right to x” “I’m pushing all the buttons” “I’m going to play with an OP account to breeze through”

By in-person playtests I mean getting a 3rd party unrelated to game development to play the game. It could be as informal as handing somebody a device out in the wild, or could be an organized, in-office thing where you hire people to come in and play the game.

•  They’re a pain-in-the-ass to arrange whether you’re bringing strangers from Craigslist into your office or harassing them at a coffee shop. And they take a lot of skill to run & analyze well – not to prompt the person, to jump to solutions, conflate problems, hear beyond what they say

•  They are psychologically difficult – exposing your work is hard, It’s pretty equivalent to the feeling most of us have about getting up to sing in a public Karaoke bar, only without it being appropriate to get a little drunk first. when combined with it being a pita, get pushed off indefinitely “it’s not ready” “they’ll only tell us what we already know”.

•  Depending on how you’re recruiting

Companies like Usertesting.com (whom we use) and others •  Generally a directed task testers are supposed to complete, so less natural

experience than just picking up a game and exploring, higher willingness to “figure it out”

•  Limited view of body language, narration expresses conscious thoughts, not unconscious, no chance to follow-up

•  Still limited to first session experience, now with a time limit •  Still small sample, luck of the draw on testers, some selection bias in terms of

who takes these kind of gigs

By this I don’t mean in person, but sending out a mobile build or a link to a web version to a group of people you know and seeing what happens •  More realistic experience, they’re testing the whole game across as many

sessions as they want to play •  Depending on who it goes to you can get good qualitative feedback,

•  Audience tends to be biased in your support, professional game developers, or both. a lot of people won’t want to hurt your feelings. You’re unlikely to hear “this sucks” even if it’s true.

•  Low sample size & unrepresentative/biased audience = mostly garbage metrics

The way to think about this: if we assume that tutorial completion is around 80% globally, if I get 400 installs representative of the global audience 95% of the time their tutorial completion rate will be between 72% and 88%.

Now just because you have a smaller sample than this doesn’t mean metrics are useless. In a normal distribution you are more likely to be closer to the mean than farther away. My unscientific rule of thumb is that you start getting directionally useful if not accurate metrics when an event has occurred 75 or more times. So for directionally useful tutorial results you need about 100 people installing a game, and for directionally useful buyer conversion rate more like 5,000.

By this I mean inviting a broader group of players to play a game not yet released, either on web or Android, usually volunteers from a fanbase •  All the benefits of friends & family tests with larger sample sizes! •  REALLY engaged audience excited to give feedback, not constrained by

politeness. They’ll tell you your game sucks. •  Chance to build a community for your game pre-release

On Kongregate.com beta access to games is a benefit of Kong+, our ad-free version players pay $29.99/year. We also gift it to our volunteer moderators and big spenders. There are about 30,000 MAU, and the average beta game gets ~3k beta users. We consistently have 5-10 games in closed betas, which we use for our publishing portfolio, but is also open to other developers

Metrics are the product of the audience and the underlying game. You can get average metrics by either putting poorly qualified traffic into an amazing game, or by putting amazing traffic into a mediocre game. Now this is somewhat obvious, but if you’ve been working on a game for a year it’s easy to forget about audience, and think metrics are entirely about the game. It is especially easy to underestimate how BIG the audience swing can be. This was the most extreme split we’ve ever seen, with a 9X difference in % buyers, which is more commonly 3-4x higher in beta than in global release. But even though it’s inflated it’s still useful: we could tell this was going to be a high ARPU game with mediocre initial retention but good long-term retention. (Note, web d1 tends to be much lower than mobile, but they are comparable by D30.

This is the now classic method for mobile, releasing fully but in just a few countries, often Canada and/or Australia The real thing is hard, and a lot of the weakness are just aspects of the mobile game market •  It’s hard work releasing anything on mobile – builds, screenshots, but particularly

getting games working on such a broad range of devices •  Long Apple approval times makes iteration slow even when you can quickly fix a

problem •  Traffic doesn’t magically show up in games and buying it is expensive – average

is ~$3 per install in Canada & Australia, a bit less on Android

You can do

Australia & Canada may be good proxies for the US, but that’s likely to be <1/3 of installs They used to be closer, but the gap widened after the release of the iPhone 6 when a lot of high end device users switched back to iOS Especially for more niche, high LTV genres like CCGs, RPGs & Strategy games whether it’s CPIs going up or retention & LTV going down after your first “golden cohort”. In a small market a few big buyers can blow out a market. And test market spend tends to be less ROI focused, so you see weird patterns.

Spellstone, a polished collectible card game from Synapse games, shows some of the dynamics. You can see that the performance of iOS is generally much stronger than Android, and that paid generally is quite a bit stronger than organics. But the really dramatic number is the huge drop in performance on organic traffic coming from the substantial features that the game got, with around a 70% drop in the ARPU on both iOS and Android. This is most dramatic with more niche, high LTV genres like CCGs, RPGs & Strategy games. For a more casual, broad audience game like AdVenture Capitalist we don’t typically see a big delta.

Your game is driven by outliers and their presence or absence distort almost anything you look at. Binary “yes/no” metrics like D1 retention or tutorial completion are more reliable than averages involving engagement or revenue. And the deeper your game, the less spending is capped, the more unpredictable those averages get. So as much as possible look at binary metrics that are proxies for the averages, or answer the questions. % repeat buyers, for example, rather than ARPPU.

Don’t get me wrong, I love A/B tests. They divorce correlation from causation, and manage the audience mix problems well, too. But don’t expect to be able to run a lot of A/B tests in test markets unless you’re willing to spend major $$s. The problem is sample size. Take the numbers I was giving earlier for cohort sizes then double them for an A/B test. And with an A/B test you need to measure the results fairly precisely, directional numbers are not good enough, or you can make bad choices based on the results.

More of a strategy than a testing method, but one I think is underused. I’m biased of course! I have a web portal. But this strategy has worked for a lot of big successes, from King with Candy Crush to Blizzard recently with Hearthstone. – Steam, Kongregate, Facebook, Miniclip, Newgrounds, Addicting Games and hundreds more. But it’s important to find the right audience fit – Facebook is a much better choice for a very casual game than Steam or Kongregate Better social feature support (forums, videos, streaming, etc) to build community Comparable LTVs (at least on Kongregate) Chrome no longer supports the Unity plug-in, and Firefox will likely kill it by the end of the year. Flash is still going for now, but will likely be phased out in a few years. But the webGL export from Unity is improving rapidly, and there are a lot of other good cross-platform frameworks to work with, such as Haxe.

Over the next 6 months they worked on polishing the UI, adding monetization, all while releasing it on dozens of additional platforms across the web, and were able to extend the content and improve the balance while building a bigger and bigger audience.

Mobile test markets lasted about 2 months, focused on mobile device stability, FTUE and the new rewarded video integration, a huge addition to monetization

And since each type of test adds something, the ideal plan is to use all of them, in approximately this order, with company playtests throughout a given.

How much money and time you have are intimately related: time is generally the biggest cost because each month of a studio’s burn rate adds up. But there are situations where that’s not true: in a big company you may have intense pressure to launch by a certain date but a lavish budget for test market marketing. Or an indie doing work-for-hire or with a full-time jobs may have little time pressure but no money to buy traffic. The indie should focus on in-person play tests and closed betas, while with enough money the big studio can blast their way through geo-locked test markets. A simple puzzle game or endless runner that are easy to pick up and play are going to get more value out of the experiential testing that help them nail the fun in the core experience, but isolated, single session tests are much less useful to multiplayer games with deep metagames and economies, where long-term metric-based testing is crucial to getting things right. Games with lots of graphics, or are otherwise technically demanding, are going to need extra time in mobile geo-locked test markets to deal with all the problems that crop up with low memory and low GPU devices on both Android & iOS And finally: what are your goals? Do you expect to get significant features from the platforms? Are you look for top 10 grossing? Top 1000? The bigger the launch you’re expecting, the higher the stakes and the more crucial it is to have the game in the best state possible at launch.

Assuming both time and money are at least somewhat constrained. Say a mid-sized studio with most of their burn covered by existing game income, but not a big cushion. This still assumes 6 months from friends/family to global mobile release. If you cut it much shorter on a multiplayer game you will almost certainly regret it.

Single-player game made by a small cash-strapped team, mobile-specific controls and ad-based monetization. In this case most of the real game testing and iteration should be on the back of rigorous, frequent, in-person testing. Then mobile test market can concentrate on just a few metrics.

During pre-production & production the key question you’re asking is “Is this fun? Are we on the right track?” The sooner you figure out something isn’t working as you expected, the easier it is to fix. The more you are departing from convention and comparables the more you need to validate as you go along.

As you approach release you’re asking “Is This Ready”? It’s a great time to do remote playtests focused on the first time user experience, and make sure that analytics are hooked up and firing correctly – that last is not a given, analytics are very easy to screw up. This is also a great time to send your game to a 3rd party QA service to test on a broad range of devices if you’re going straight to mobile.

It’s the next Clash of Clans! Or Crossy Road! Everything is broken! Who would even play this piece of shit? Total failure. Depending on the person, they may cherry-pick the good, or focus only on the bad.

This is where the rubber really meets the road. To know if the game is working, you have to know what you’re looking for.

We have a very successful game with 20% D1 retention. We have very successful games with $0.03 ARPDAUs.

So here are some sample metrics ranges, from low to high, for the genres I’ve been using for example test plans, then for all genres. This is loosely based on the metrics we’ve seen from games in our portfolio and more generally on Kongregate.com. As you can see the “all genres” low/medium/high can be pretty different from the ranges by genres. Good retention for a multiplayer RPG game is drastically lower than an idle game. Good monetization for a casual runner would be terrible for that same RPG. What’s missing from this is expectations around traffic: that casual runner has probably 10x the potential traffic of the multiplayer RPG.

Here’s a couple of outcome models based on the average metrics for each of these genres, then broken out by different levels of traffic. You’ll notice these profitability scenarios aren’t great: games with average metrics need exceptional traffic to become a sustainable business. In general for a success you need at least one or two metrics to fall in one of the “high scenarios”. Set your goals matching realistic expectations of the genre and acceptable (not ideal) business outcomes. If you need to hit top 20 grossing to justify huge budget/company expectations, your goals should be much higher than if you mostly want to learn.

It’s not that you don’t look at other metrics, but you want to set the gates based on the most important one for that stage.

One of the benefits of breaking your testing into stages is that on mobile it allows you to use a wider variety of test markets. Canada & Australia are not only expensive places to test stability, they’re a bad choice because they have a much lower % of the low-end devices most likely to trigger issues. That’s better tested in emerging markets. Overall testing in a range of countries from tier 1 to tier 3 will give you a much more representative view of your global performance than limiting to a few English-speaking markets. We’ve used more than 20 different countries in the last year, all shown in this map.

Note that the sample sizes are cumulative. 12k isn’t enough for statistical significance on buyer %, but 25k is.

This is not optional, but a surprising number of developers don’t. Crashes are annoying to players, effecting both retention and your ratings and reviews in the app stores. We recommend a 3rd party service like Fabric/Crashlytics or Hockey App, though there’s a quite bit of info in Google Play console as well.

Stage 2 is where you’re really optimizing your game, and you should look at in the same stream that players are moving through the game, as that’s the order in which sample sizes will get large enough, and because problems in one will likely flow into the next. You start with the progression through the first time user experience, checking drop-off pre-tutorial, and then at each tutorial step. Then you’ll watch PVE progression, what’s the progression through missions, what are the win rates. After players have been in the game a little longer you want to look at PVP participation and win rates, and then finally at the economy, where you should look at the full sinks and sources flow, but keep a particularly close eye on resource balances and how they grow.

Retention is the KPI reflection of progression. Without longer term retention, which reflects commitment and engagement with the game, few people will pay. Conversion reflects both engagement and balance: do they care enough to buy something, and is there a good reason to. If retention is good but conversion is bad, then either what’s being sold is not compelling, the balance is not challenging enough for it to be useful, or the economy is imbalanced in a way where there’s no reason to purchase, because you can get it for free. Note that there is some tension between retention and conversion, though, because tight economies may make players more likely to churn. New buyer packs can be great at boosting conversion, while masking underlying problems. One of the most important stats to look at is repeat conversion – how many people buy a 2nd time? A 3rd? Repeat conversion shows both how players feel about their first purchase and whether there is depth of spend. If you have a high conversion rate but low repeat purchasing, your game will just pop and drop. Note that I haven’t mentioned ARPPU. That’s because while it’s an important metric, it’s not one you can really look at with any reliability because revenue is an exponential distribution, and very erratic in small samples. However ARPPUs are capped fairly low unless there are repeat purchases, so looking at that statistic, which is a normal distribution, answers the same question with better stability.

It’s human nature to project causation, so as you make changes to your game you’re likely to look at daily numbers and think it’s the result of what you’ve done. Resist. Even with a game doing a reasonably big test market you get tons and tons of random variation in the daily numbers. These numbers are from Spellstone’s test markets, which had between 1000-1500 DAU through most of this but the daily numbers are all over the place. We track rolling averages, which helps, but if you want to look at the impact of changes roll up cohorts from before and after the change to get a statistically significant sample to look at. And still take that with a grain of salt because of audience mix and confounding effects from other changes you’re making.

An important part of mobile test markets is optimizing the assets you’re using in the app store – we regularly see substantial gains testing icons, video, screenshots, and copy, though we’ve never seen a significant difference from game name testing. But context is important – results are often inconsistent between app stores and geos. For that reason we don’t start our ASO until we’ve expanded beyond T3 markets. Google’s tools for this are great, but for iOS you need to use a paid service like Store Maven.

Test market marketing isn’t just about driving installs, it’s about testing the marketing itself, optimizing creatives & targeting, generally figuring out: will this work? Will I be able to drive audience into my game? Can I do it profitably? Test a lot of creatives across a lot of networks. Keep refreshing. You never know what’s going to work because again context matters a lot.

There’s nothing worse than having a big feature and then have the servers crumble under the load. It feels like flushing success down the toilet, and you are. Now “how to load test a game” is a subject that deserves it’s own GDC talk and I’m not the person to give it. But if you have a server based game and any hope of success you need to do this.

Now I want to go through a case study of what this looks like with a game that that was neither a triumph or a failure. Raid Brigade is a one-handed party-based Action RPG with base-building elements and an unusual one-handed control scheme. It’s the first game made by San Diego studio Ultrabit made up of mostly Zynga veterans. After about 12 months of development we released it to mobile test markets last June, skipping our normal PC stage because of questions on how the controls might work. We were all excited for the game, but the initial results were way below our goals and expectations: the first few weeks saw dreadful performance with only 40% of people getting through the tutorial and D7 retention 75% below the goal numbers.

The good news is that there were lots of obvious things to fix. There were long loading times for assets being streamed in, essential to keep the initial download size of a game with 3D art under 100MB. Improving those and the tutorial got tutorial completion up to 70% and doubled D7 retention, though still well below our goal of 18%.

To help people get in to the various branching systems we then switched from a linear tutorial to one based on a series of quests, which again helped increase D7 retention significantly, this time up to about 12%. Again it was great to see that much improvement, now 3x what we started at…but still below goals. And gains in retention after that became harder to get. Though we kept working on that, along with many other things. After 3 months in test markets we we had to face the dilemma: the game was meeting some of our goals (crash rate, tutorial, D1 retention, conversion) but not all of them (D7, D30 retention, Repeat Buyer %) and the developer’s runway was starting to be a concern. We could go ahead and launch in October as planned, or keep working on it, cut into the developer’s runway, and go up against the glut of games coming out in the holiday season.

When your metrics are good the decision path on when to launch is easy. When you have plenty of time and money, you can usually keep working on the game, though even there it’s important to be realistic about whether the game is fixable. If you’re pretty sure you understand what’s wrong and have a good idea to fix it that’s great, but there may be diminishing returns or unfixable flaws. When you get to the point where you either can’t keep working on the game, or it doesn’t seem worthwhile then the question becomes: do you launch? At that point it’s time to think of the money you’ve spent on development as a sunk cost, and think just about the effort needed to launch and support the game after launch. For a single-player game without servers, this is fairly simple, and in most cases you should go ahead and launch and see what happens. But with multiplayer games this becomes more complicated as there are ongoing server costs and the necessity of releasing additional content to drive revenue and engagement, as well as critical player mass and opportunity cost issues. Supercell and some other companies would only support a game if it’s a huge success, but exactly what that level should be is going to have different answers by studio. But personally I don’t think you should launch a game unless you can support it for a fairly extended time frame, a year plus at least. Players are investing in your game, and that should be respected. In 4 years of publishing we’ve never cancelled a single-player game, though we did stop one from bothering to do an Android version. But this year alone we canceled three multiplayer games, one after a Kong+ beta and two during mobile test markets.

Making a game these days can feel like walking into a dark forest. But remember: you have many tools at your disposal. Be prepared, like the boy scouts say, have a plan, be realistic, and hopefully you can make it through the forest and find the treasure you were looking for.

effective testing of free-to-play games

Technology