full stack web performance - computer business review€¦ · full stack web performance ... roles...

Tom Barker

Full Stack Web PerformanceFull Stack Web Performance

Compliments of

Incapsula helps you take care of business by simplifying ops

and protecting your web apps. Our PCI-certified and SOC 2

compliant cloud service is easy to deploy, intelligent and scalable.

We secure websites from top web threats like SQL injections,

XSS and web scraping so your customers can go about their

business with confidence.

Secure and Accelerate Your Website

Find out more about what Incapsula can do for your business.

https://www.incapsula.com/web-application-security/

Tom Barker

Full Stack WebPerformance

Boston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol TokyoBeijing

978-1-491-98844-2

[LSI]

Full Stack Web Performanceby Tom Barker

Copyright © 2017 O’Reilly Media, Inc. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA95472.

O’Reilly books may be purchased for educational, business, or sales promotional use.Online editions are also available for most titles (http://oreilly.com/safari). For moreinformation, contact our corporate/institutional sales department: 800-998-9938 [email protected].

Editor: Meg FoleyProduction Editor: Shiny KalapurakkelCopyeditor: Octal Publishing, Inc.

Interior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator: Rebecca Demarest

July 2017: First Edition

Revision History for the First Edition2017-06-16: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Full Stack WebPerformance, the cover image, and related trade dress are trademarks of O’ReillyMedia, Inc.

While the publisher and the author have used good faith efforts to ensure that theinformation and instructions contained in this work are accurate, the publisher andthe author disclaim all responsibility for errors or omissions, including without limi‐tation responsibility for damages resulting from the use of or reliance on this work.Use of the information and instructions contained in this work is at your own risk. Ifany code samples or other technology this work contains or describes is subject toopen source licenses or the intellectual property rights of others, it is your responsi‐bility to ensure that your use thereof complies with such licenses and/or rights.

http://oreilly.com/safari

Table of Contents

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1. Client-Side. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Use a Speed Test 1Now Integrate into Continuous Integration 5Use Log Introspection for Real User Monitoring 6Tell People! 8Summary 9

2. Accomplishing Web Performance Wins via Infrastructure. . . . . . . . . 11Using a CDN 11Edge Caching: Serving Your Application as Close as Possible

to Your User 12Make Requests to the Fastest Possible Origin 13Using a Cloud Provider 14Summary 17

3. Operationalize Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Setting Up an APM 19Using an APM to Troubleshoot Performance Issues 20Summary 25

4. Next Steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Get Synthetic Web Performance Results 27Trial a CDN for Free 27Trial an Application Performance Management Tool for Free 28Embrace Full-Stack Development and DevOps 28

iii

Introduction

We are in the midst of another tidal change in the software engi‐neering and IT industries. This has been going on for a number ofyears already, but like the frog in the pot that doesn’t notice thewater slowly beginning to boil around him, some of us might nothave noticed the transitions in our environment. We’ve overlookedthese transitions because there have been many smaller ones that wejust adjusted to, that accumulated to be a big significant change. Ormaybe it’s just that the ideas behind these changes have been talkedabout for quite some time, but it’s only relatively recently that theyhave coalesced into actionable patterns that are easy to implementand reproduce.

I was reminded of this recently because some of my teams—specifi‐cally those that are working on products that we made three or moreyears ago—are migrating their products from physical datacentersto cloud platforms.

Think about that for a minute: three or four years ago we were start‐ing new projects first by requesting nodes—in some cases, virtualmachines (VMs) on a hypervisor, and in other cases actual physicalboxes—and IP blocks, waiting days, or in most cases, even weeks forthe boxes to be configured. Today, of course, we run a script and ourcloud platform of choice spins up nodes preconfigured with theimage that we want nearly instantaneously.

Our notions of web performance and capacity planning, too, havechanged. Now if we need to scale a cloud web-native application tohandle spikes in usage, it’s a matter of only selecting a checkbox andpaying for the scale that we need. There are also new challenges inthe field that we are just now discovering, and coming up with

v

workarounds for, like the appropriate use of availability zones (thinkabout the Amazon Web Services outage of 2017 that brought down asignificant portion of the web) or even how to use multiple cloudservice providers to serve a single property.

Even our organizational identities are changing. If someone is a webdeveloper, why can’t they learn to request and configure VMs fromtheir cloud provider? And if they are setting up the machines onwhich their code resides, and maybe even the firewall rules, why notthen set up their own log consumption flow, then at that point arethey still just a web developer, or are they maybe a full-stack devel‐oper or even a DevOps engineer?

These are just a few examples of how a concept like DevOps haschanged the day-to-day activities of our work routines.

Who This Book Is ForDevOps encapsulates different things to different groups. Someinterpret it as an integrating of traditional infrastructure or Opsroles into a development team, whereas others bundle security inthere and call it DevSecOps. Maybe some groups have differentinterpretations of what Ops means, defining it as incorporating notjust infrastructure but also production support roles.

As such, this book is for anyone who needs to think about and dealwith performance in a DevOps environment. From web developer,to DevOps engineer, to engineering manager and architect, thisbook is intended for you.

This book has set out to address how web performance fits into thismodern landscape of the all-encompassing cross-functional DevOpsteam. You’ll find the topics organized into three high-level areas offocus in a product development group:

Client-sideThis is the user-facing piece of the application. It will generallyrun on the user’s hardware.

InfrastructureThis consists of the facilitating pieces of your application,specifically the content delivery network (CDN) and cloud ser‐vice.

vi | Introduction

OperationsThese are the practices you put in place to monitor and alert onthe health of your applications

This book also presents significant but quick wins throughout. Thatis a relative term, but the way that I have approached this is to takeadvantage of existing tools and libraries that are fast to integratewith but have huge payoffs. Depending on your architecture andteam makeup, the level of effort for each of these solutions or rec‐ommendations should be measured only in days and weeks, notmonths for the most part (at least from my reckoning; your mileagemay vary).

Introduction | vii

CHAPTER 1

Client-Side

In web development, the client-side is the code that runs on theusers’ hardware—either in browsers on their laptops, web views in amobile app on their phones, or perhaps even in a render engine run‐ning on the set-top box in their living rooms.

There is a huge body of written work around improving web perfor‐mance, beginning with Steve Souders’ seminal book, High Perfor‐mance Web Sites, but the landscape of client-side performancechanges rapidly, in large part because of proactive performanceimprovements implemented by the browser makers themselves, aswell as the work being done at the standards level.

So, as busy product development engineers and engineering leaders,how do we keep up with the changes and reconcile the differencesacross browsers? One way is to rely on synthetic performance test‐ing using one or more of the tools available today.

Use a Speed TestSpeed tests, or performance testing tools, are applications that load asite and run a battery of tests against it, using a dictionary of perfor‐mance best practices as the criteria for these tests. These tools areconstantly updated and should reflect and test against the currentbest practices. They keep track of changing practices, so you don’tneed to.

1

1 Available at https://www.webpagetest.org. This is run by Patrick Meenan and is used topower such sites as the HTTP Archive.

Two terms you’ll see around performance testing aresynthetic testing and real user metrics. Real user metricsis data gathered from actual users of your site, whichyou harvest, analyze, and learn from to see what youraudience’s actual experience is like. Synthetic testing iswhen we run tests in a lab to identify performance pit‐falls before we release to production. Speed tests arewhat we would call synthetic tests.

There are many performance testing tools on the market, frombrowser-integrated tools like YSlow, or free web applications likeWebPageTest, to full enterprise solutions. My personal favorite webperformance testing tool is WebPageTest.1 You can use the hostedsite as is, or, if you want, you can download the project and run iton-premises using so-called private instances, so that you can use itto test your preproduction environments that are usually not pub‐licly accessible.

WebPageTest essentially employs a number of agents around theworld that can run a huge variety of devices to go to your site andgive you the performance metrics from each run. Figure 1-1 showsthe WebPageTest homepage.

Note the Test Location drop-down menu. That’s where you chooseyour agent. You can use this to choose where the test is run, but alsowhat hardware and operating system on which it is run. You alsocan choose a number of other options, including the type of connec‐tion to use, the different browsers that are available to that particularagent, and whether you want to test just the initial uncached view ofthe site or also additional cached versions for comparison.

2 | Chapter 1: Client-Side

https://www.webpagetest.org

2 You can find more information about Speed Index at https://sites.google.com/a/webpagetest.org/docs/using-webpagetest/metrics/speed-index.

Figure 1-1. The WebPageTest homepage

Each test you run returns results that include the following, amongother things:

• A letter rating of A–F, which describes the site’s performance• A high-level readout of your most important metrics such as

time to first byte, Speed Index,2 number of HTTP requests, andtotal size of payload for site (see Figure 1-2)

• Waterfall charts that show the order and timing of each assetdownloaded

• A chart that shows the ratio of what asset types make up yourpage’s payload

Use a Speed Test | 3

https://sites.google.com/a/webpagetest.org/docs/using-webpagetest/metrics/speed-index

https://sites.google.com/a/webpagetest.org/docs/using-webpagetest/metrics/speed-index

Figure 1-2. The results of my WePageTest run (the F’s in my score arebecause of content loaded onto my page from an advertising partner)

Also included in your results are details of your rating that outlinethe items that were flagged against the criteria of the test. Thesedetails are essentially a checklist to follow to improve your site’s webperformance. See Figure 1-3 for an example of these. Do you comeacross some images that aren’t compressed? Compress them. Do younotice some files that aren’t served up gzipped? Set up HTTP com‐pression for those responses.

Figure 1-3. Flagged items

Let the tool do the analysis so that you can manage the implementa‐tion of its findings.


Now Integrate into Continuous IntegrationPerformance testing tools are great, but how do you scale them foryour organization? It’s not practical to remember to run these testsad hoc. What you need is a way to integrate them into your existingcontinuous integration (CI) environment.

Luckily WebPageTest provides an API. All you need is an API key,which you can request at http://www.webpagetest.org/getkey.php.

Using the API, you can write a script in your language of choice andprogrammatically run tests against WebPageTest. The tests can takea little while to run, so your script will need to poll WebPageTest tocheck the status of the test until the test is complete. When the test iscomplete, you can iterate through the response and pull out eachresult. Figure 1-4 presents a high-level architecture of how thisscript might function.

Figure 1-4. High-level architecture of how the script might function

After you have the result, you can do any number of things: you cancreate charts with those results with a tool such as Grafana (https://grafana.com), you can store them locally, and you can integrate theseresults into your CI software of choice (see Figure 1-5). Imagine fail‐ing a build because the changes introduced a level of latency thatyou deemed unacceptable and holding that build until the perfor‐mance impact has been addressed!

Now Integrate into Continuous Integration | 5

http://www.webpagetest.org/getkey.php

https://grafana.com

https://grafana.com

Figure 1-5. Integrating your results into your CI software

Of course, as with anything that breaks the build, the conversationswill need to be had around whether the change is worth the impactand do we raise the accepted latency even just temporarily to get thisfeature out—but still, at least these conversations are happening andthe team isn’t trying to figure out what change affected performanceafter the fact and how to deal with this while your app is live in pro‐duction.

Use Log Introspection for Real UserMonitoring

Real User Monitoring is the practice of recording andexamining actual user interactions and deriving theperformance data from these interactions. This is gen‐erally a more useful metric to report than syntheticresults because the numbers reflect real experiences ofyour actual users.

So, this is great for testing during development, but how do youquantify your performance in production with real users? Ananswer to that is to use log introspection. Log introspection really justmeans harvesting your server and error logs into a tool such asSplunk or ELK (Elasticsearch, Logstash, and Kibana), which allowsyou to query and build monitors, alerts, and dashboards against thisdata, as illustrated in Figure 1-6.


Figure 1-6. Dashboards, monitors, and alerts against data

But, wait, you might be saying: aren’t logs primarily about monitor‐ing system performance? True, logs are generated on the backend,and traditionally the data in these logs pertain to requests againstthe server. Although they’re hugely useful to gather HTTP responsecodes and ascertain fun things like vendor API Service-Level Agree‐ment compliance, they do not naturally lend themselves to captur‐ing client-side performance metrics.

Luckily one of my friends and colleagues, John Riviello, has createdan open source JavaScript library named Surf-N-Perf that capturesthese metrics on the client-side and allows you to feed them to anendpoint you define. Some of these metrics are data points to allowyou to see when a page has started to render, when it finishes ren‐dering, or when the DOM is able to accept interaction. These are allkey milestones to evaluate a user’s actual experience.

The library uses the performance metrics available to the window.performance object (documentation for which is available athttps://w3c.github.io/perf-timing-primer/).

All you need to do after integrating Surf-N-Perf is create an end‐point that takes the request and writes it to your logs, as depicted inFigure 1-7. Preferably, the endpoint should be separate from themain application; this way, if the application is having issues, yourlogs can still properly record the data.

Use Log Introspection for Real User Monitoring | 7

https://github.com/Comcast/Surf-N-Perf

https://w3c.github.io/perf-timing-primer/

Figure 1-7. Creating an endpoint that takes the request and writes it toyour logs

This allows you to see what the actual real performance numbers arefor your customers. From there, you can slice this into percentilesand say with certainty what the actual experience is for the vastmajority of our users. This also allows you to begin diagnosing andfixing other issues that might not have been caught by your syn‐thetic testing. We talk more about this in Chapter 3.

Tell People!Your application gets mostly straight A’s in the synthetic tests yourun, any performance affecting changes are caught and mitigatedbefore ever getting to production, and your dashboards show sub‐second load times for the 99th percentile of your actual users. Con‐gratulations! Now what? Well, first you operationalize, which we talkabout in Chapter 3, but after that you advertise your accomplish‐ments!

Chances are, your peers and management won’t proactively notice,though your users and your business unit should. What do you do


with all of these fantastic success stories you have accumulated alongwith the corroborating metrics? Share them.

Incorporate your performance metrics into your regular statusupdates, all hands, and team meetings. Challenge peer groups toimprove their own metrics. Publish your dashboards across yourorganization. Better performance for everyone only benefits yourcustomers and your company, so collaborate with your business unitto quantify how this improved performance has affected the bottomline and publish that.

SummaryIn this chapter, we looked at using speed tests to keep up with theever-changing landscape of client-side web performance best practi‐ces. For extra points, and to realistically scale this for an organiza‐tion, we talked about automating these tests and integrating theseautomated tests into our CI environment.

We also talked about utilizing our log introspection software, alongwith an open source library such a Surf-N-Perf, to capture ouractual real user metrics around client-side performance. How wethen diagnose and improve those numbers in production is thefocus of Chapter 3.

Summary | 9

CHAPTER 2

Accomplishing Web PerformanceWins via Infrastructure

When looking holistically at your stack, there are some significantwins that you can gain through your infrastructure without imple‐menting huge architectural changes. If you are not already doing so,using a Content Delivery Network (CDN) will show immediate andsignificant performance improvements, just like utilizing an elasticcloud platform will allow you to scale on demand to prevent perfor‐mance bottlenecks. Let’s take a look.

Using a CDNA CDN is a globally distributed network for hosting and servingdata. Although it is possible to set up your own private CDN, thisrequires setting up a lot of infrastructure. So, for the purposes of thisbook, we will discuss the commercial options available for the fol‐lowing:

• Edge caching• Global traffic management

11

Edge Caching: Serving Your Application asClose as Possible to Your UserOne of the biggest causes of latency on the server side is simply theproximity of your end users to the machines serving your applica‐tion. It’s pure physics: data is transmitted as light down fiber-opticlines but, at best, data travels at two-thirds the speed of light, so thecloser your visitors are to your server, the faster they receive thedata. Many companies utilize multiple datacenters across the coun‐try (and around the world), so presumably, you might have contentserved from each coast, but what if you could serve content from thesame state or even the same city? That’s the beauty of edge caching.

Most CDNs maintain an edge network, which is a network of nodesdistributed across the country or world that can host your content.The idea of serving content from the edge is that the CDN can servecontent from an edge node that is closer to a user than your datacenter is, and, as just stated, the closer the source of the content is tothe end user, the faster it is received. When you cache your contentat the edge, your response times are even faster, and you get theadded bonus of reducing the amount of traffic going to your data‐center origins (thus requiring fewer nodes and having less to main‐tain, as I explain in my book Intelligent Caching [2017, O’Reilly]).

Cache is a mechanism to store HTTP responses for usein future HTTP requests to prevent the need to lookup and retrieve that data again. When talking aboutweb cache, the body of the HTTP response is indexedand retrieved by using a cache key, which in its mostbasic form is the HTTP method and URL of therequest (enterprise-level CDNs typically offer moreadvanced cache-key customization if needed).

Figure 2-1 presents a visualization of what cache is that breaks downHTTP cache, where the response is cached at the web server (oredge node) versus browser cache, where the browser itself holds thecached response.

12 | Chapter 2: Accomplishing Web Performance Wins via Infrastructure

Figure 2-1. Showing the differences in flow from a traditional unc‐ached HTTP request versus a request cached at the web server versus aresponse stored in browser cache

To make the most of edge caching, we need an architecture thattakes advantage of it. If all of our logic is server-side and must beexecuted before the response can be returned from the server, we getvery limited benefit from edge caching. Instead, we can get the mostfrom edge caching if we have a highly cacheable base page that asyn‐chronously loads in content from the client side.

Make Requests to the Fastest Possible OriginIn addition to proximity to the end user, another cause of latency onthe server side might be unhealthy nodes. Maybe one of your data‐centers has nodes that are unresponsive or take significantly longerto respond. This would make your site appear to be painfully slow ifnot down completely. Wouldn’t it be great to not just know that thisis happening, but to be able to reroute incoming requests only to thehealthiest and fastest datacenter? Most CDNs offer global trafficmanagement (GTM) as a function to solve for this.

GTM is a feature of a CDN to balance traffic between datacenters.Generally, the GTM will use the following criteria for routing traffic:

Make Requests to the Fastest Possible Origin | 13

AvailabilityIs the data center available? This might be as simple as doing acurl or ping on a healthcheck file. If the file is found andreturned with no errors, the datacenter is technically available.

ProximityIs this the closest datacenter to this user?

PerformanceAre the responses from the datacenter within an establishedservice-level agreement (SLA) for performance, or is there ahigh percentage of 504 messages returned? Sometimes, the clos‐est datacenter is not the fastest, for a variety of reasons.

Figure 2-2 offers an overview of how GTM works.

Figure 2-2. GTM routing traffic between datacenters

Using a Cloud ProviderOne of the biggest reasons that a backend is not performant isbecause it is not scaled appropriately. Picture it: you are in the mid‐dle of your peak traffic time and your performance begins to cometo a crawl. People are calling in, complaints are stacking up, produc‐tion incidents are being filed.

Maybe your servers are running out of HTTP requests, or yourCPUs are red-lining, or memory is dwindling. Whatever the case,


responses to incoming requests are beginning to slow down, andeven are turned away completely.

You could release some of this pressure on your backend and speedthings up if only you could request more nodes. But because you areone of a multitude of customers that your provider must serve, theoperations department at your company has a five-business day SLAfor turning around new node requests. There is no help coming, allyou can do is increase the heap size, increase your request pool,restart your nodes, and try to ride out this wave of traffic and dealwith the implications afterward.

But it doesn’t need to be this way.

Had you used a cloud provider for your infrastructure, one thatallows for elastic scaling of nodes, you would have been fine. Thereare a number of hosted commercial solutions like Amazon WebServices (AWS) or Microsoft Azure, or an on-premises solution.

The basic architecture of a website running on a cloud platformshould look very similar to what you might expect a traditionalarchitecture to look like. There are application nodes running inavailability zones (think datacenters in more classical architecture)all running on a cloud platform. In front of all of this, there is a loadbalancer routing incoming traffic, as depicted in Figure 2-3.

Using a Cloud Provider | 15

Figure 2-3. A basic cloud infrastructure is not so different than a clas‐sic infrastructure

You can set up all of this and run it either from a GUI or automate itvia command line.

Automate Scaling to Accommodate Spikes in TrafficThis is fantastic on its own, but most cloud providers also providefor elastic scaling capabilities. With elastic scaling enabled, the cloudplatform will spin up new nodes to accommodate increased load,and then spin them down when they are no longer needed.

Each cloud provider is different, but this is generally achieved byestablishing elastic scaling groups that define the set of nodes toincrease and decrease as needed. The scaling functionality monitorsthese nodes and, based on criteria you can define, will expand andcontract based on the thresholds that you set. Figure 2-4 provides avisual example.


1 Amazon S3 experienced a large-scale outage on February 28, 2017 that affected allusers connected to the AWS US-EAST-1 Region. For more information on this see, goto https://aws.amazon.com/message/41926/

Figure 2-4. Availability zone

As the world learned in the great AWS Outage of2017,1 even cloud providers go down. When AWSwent down, sites that relied solely on the availabilityzone that failed experienced severe outages themselves.

When using a cloud provider, at a minimum, you should use severalavailability zones and regions to minimize impacts of downtime inany one availability zone.

An even better idea is to take advantage of the functionality of aGTM from your CDN to route traffic between several differentcloud providers to maximize potential uptime.

SummaryThis chapter looked at infrastructural performance optimizationsthat you can implement. We looked at some of the easy wins thatyou can derive from using a CDN, serving cached content at theedge, and routing to the best possible origin.

We talked about using a cloud service provider to create an infra‐structure that could expand and contract as needed to avoidperformance-killing bottlenecks.

In Chapter 3, we look at tools to help operationalize our perfor‐mance.

Summary | 17

https://aws.amazon.com/message/41926/

CHAPTER 3

Operationalize Performance

Your site is out in production; performance is where you want it;everything looks great.

You think.

But how do you quantify what the actual experience is out in thewild? How your machines are performing with real users, usingtheir own devices connected via various networks, each of variedquality? Even more important, how do you identify, triage, anddebug an issue in production that is affecting actual customers?

You use an application performance management (APM) tool. Thereare many such tools out in the market; some of the more popularare New Relic, AppDynamics, and Dynatrace.

Setting Up an APMSetting up an APM is relatively painless. Generally, you just installan APM agent on the machines that you want monitored. Theagents capture metrics for the machines on which they are installedand communicate those to the hosted APM platform. The APMplatform processes the data and makes it available via dashboards.Figure 3-1 presents a diagram of that architecture.

19

http://bit.ly/2sseejR

https://www.appdynamics.com/

https://www.dynatrace.com/

Figure 3-1. The APM platform processes the data and makes it avail‐able via dashboards

Using an APM to Troubleshoot PerformanceIssuesPicture this: you are sitting at your desk when you get a call fromone of your stakeholders. They are hearing customer complaints;users are trying to access your site but are experiencing a lot oflatency.

Luckily, your site is already being monitored by an APM, so you justfire up your dashboard and look for the time period in question.

Figure 3-2 shows an example of a dashboard from New Relic.

Figure 3-2. Dashboard from New Relic

20 | Chapter 3: Operationalize Performance

The subsections that follow discuss some of the key metrics that ourAPM should expose.

ThroughputThroughput is a measurement of requests over time, usually eitherper second or per minute. This lets you see if traffic has suddenlydropped off or spiked. Even more useful, we can use this to measurethroughput by node to validate traffic shaping or identify potentiallyunhealthy nodes. In our example, if there were a widespread issue,we would probably see throughput dropping as users abandon oursite. Or, at a granular level, there could be nodes that are receivingmore requests than others, causing requests to slow down. This iswhere we would be able to see this.

ErrorsThe APM agent will track application errors that can lead to badHTTP responses and from those, craft an error rate for your appli‐cation. Error rate is literally the number of successful requests divi‐ded by the number of failed requests. Figure 3-3 depicts the errorscreen.

Figure 3-3. Error rate

With knowledge of your error rate, you can craft alerts based onthresholds around this rate to know whether issues are ticking up inproduction and if you are on the verge of an incident occurring.

Even better, if your APM supports the language and runtime thatyou are using, you also will be able to dig deeper into those errors

Using an APM to Troubleshoot Performance Issues | 21

and get stack traces from the actual function or classes that threwthe error to help debug and fix it.

Figure 3-4 demonstrates a stack trace of an error in New Relic.

This is an important point: do not assume that supportfor what you need is there, some of my teams had touse JRuby for quite some time because the APM thatour company had a contract with did not supportRuby, only Java or Microsoft .NET, so we used JRubyto at least get some metrics from the Java VirtualMachine.

Figure 3-4. Stack trace of an error

For our example, if the latency that our users are experiencing is dueto errors or is causing HTTP 504 messages, it would be evident here.At the very least, we could see what requests are timing out, or ide‐ally, if there are actual errors that we could begin to debug the stacktrace.

Most Expensive TransactionsWe can look here to see what, if any, transactions are suddenly tak‐ing much longer to respond. If a backend service is having issuesand our calls to them are timing out or just taking much longer toresolve, this is where this problem would become evident.


Even if we currently are not in fire-drill mode, we can use this fea‐ture to proactively drive to better overall performance. If we knowwhat transactions take the most time, we can focus on those and tryto lower their performance cost.

Node HealthIf you have too much traffic on a single node, you will likely see cer‐tain things happening to that node. You could see your CPU usagespiking as the machine struggles to keep up. You could see yourmemory usage running high. You could see HTTP requests begin tobe turned away, causing the node’s throughput to drop.

All of these will make your user experience come to a crawl andeventually just error out.

The APM agent will track the health of the machine that it isinstalled on and include those metrics to the dashboard. Figure 3-5shows an example of node health in a New Relic dashboard.

Using an APM to Troubleshoot Performance Issues | 23

Figure 3-5. Node health in the New Relic dashboard

Third-Party Service-Level AgreementsIf you have done your due diligence but are still experiencing highpage load times, more often than not the root cause of the slow-down will be slow responses from the third-party or partner servicesthat you call. Maybe there is an API from another internal team thatyou call to load user data, or there are parts of your site dependenton an API to process user input.

An APM tool allows you to track the response times of your exter‐nal APIs so that you can sort them by most time-consuming.Figure 3-6 gives an example of third-party API tracking.


Figure 3-6. Third-party API tracking

Having that kind of data allows you to follow-up with these partnersand have conversations about their own performance. Ideally, theywould have a previously agreed upon SLA governing the perfor‐mance of their API.

SummaryUsing an APM is critical in not just maintaining full-stack perfor‐mance, but for debugging performance issues in production, as well.Most important, between APMs and cloud providers, developmentteams are being empowered to take operational ownership of theirproducts.

Summary | 25

CHAPTER 4

Next Steps

We’ve talked at a high level about what sorts of performance winsyou can achieve when taking a holistic, full-stack view of your webapplications. So, what are some tactical next steps you can take?

Get Synthetic Web Performance ResultsAs we talked about in Chapter 1, run your site through a web per‐formance test like WebPageTest. It is free, gives you an idea of yourcurrent standing, and gives you steps to take to remediate the issuesfound.

Here are some other web performance tests:

• YSlow• Pingdom• Google’s PageSpeed Insights

Trial a CDN for FreeWe talked at length about some of the benefits we can get fromusing a CDN, but signing up with a CDN can be daunting and willinvolve cost. An easy first step on that path is to set up a free trialaccount with a CDN. Most of the popular CDNs have a free trialavailable:

27

https://www.webpagetest.org/

http://yslow.org/

https://tools.pingdom.com/

https://developers.google.com/speed/pagespeed/insights/

• Akamai, available at https://www.akamai.com/us/en/campaign/get-akamaized.jsp

• Incapsula, available at https://www.incapsula.com/pricing-and-plans.html

• MaxCDN, available at https://www.maxcdn.com/test-account/

With your free trial set up, benchmark your site behind a CDN andcompare the performance numbers to your current setup. If you arereally feeling daring, point some traffic to it and see what benefitsyou can get infrastructurally (are your machines less taxed, can youquantify how many fewer nodes you would need to maintain?). Canyou quantify these benefits and use them to justify the budgetrequest to actually make the plunge?

Trial an Application PerformanceManagement Tool for FreeJust like CDNs, application performance management (APM) com‐panies are more than happy to give you a free trial to test out theirproducts. Here are s of the notable ones:

• New Relic, available at https://newrelic.com/signup• Dynatrace, available at https://www.dynatrace.com/trial/• AppDynamics, available at https://www.appdynamics.com/free-

trial/

Install an agent and check out the dashboards for your application.Most APMs are so feature-rich that you’ll most likely find that thecompany you are trialing is happy to walk you through their featuresets. Some even offer extensive training. In the past, I have even hadthe company representative offer to help me debug a productionissue as a way of a capabilities demo.

Embrace Full-Stack Development and DevOpsThe most important next step of all is to embrace the idea of full-stack development and DevOps.

I can still remember the days of needing to reach out to an opera‐tions team when something would go wrong in production because

28 | Chapter 4: Next Steps

https://www.akamai.com/us/en/campaign/get-akamaized.jsp

https://www.akamai.com/us/en/campaign/get-akamaized.jsp

https://www.incapsula.com/pricing-and-plans.html

https://www.incapsula.com/pricing-and-plans.html

https://www.maxcdn.com/test-account/

https://newrelic.com/signup

https://www.dynatrace.com/trial/

https://www.appdynamics.com/free-trial/

https://www.appdynamics.com/free-trial/

only they could get me a snippet of the logs. And it would be a flatfile that I would need to grep through to search for things that I hadlearned to look for; things like specific error codes or HTTP respon‐ses.

I remember needing to factor hardware into my budget in thebeginning of the year for projects that had not yet been scoped oreven envisioned yet, and then waiting months for machines to beordered, shipped, and set up at the datacenter. And, if I had guessedwrong, how was I going to scale up in time to meet the demand?

The advances of platform and infrastructure as a service havebrought the power of operations on demand, if we just embrace itwith open arms.

Embrace Full-Stack Development and DevOps | 29

About the AuthorTom Barker is a software engineer, engineering manager, professor,and author. Currently, he is director of Software Engineering andDevelopment at Comcast, and an adjunct professor at PhiladelphiaUniversity.

full stack web performance - computer business review€¦ · full stack web performance ... roles...

Documents