expecto performa! the magic and reality of performance tuning
TRANSCRIPT
Expecto Performa!The magic and reality of performance tuning
MATT SHELTON | TECHNICAL ACCOUNT MANAGER | ATLASSIAN | @MATTSHELTON
DENISE UNTERWURZACHER | SITE RELIABILITY ENGINEER | ATLASSIAN
Agenda
Understanding the problem
Let’s agree… to agree
Measure (allthethings)
Benchmarking
The art of tuning
Agenda
Understanding the problem
Let’s agree… to agree
Measure (allthethings)
Benchmarking
The art of tuning
My users say that JIRA takes too long to create an issue, and that opening their boards takes forever!JIRA ADMIN
We just acquired this other company, they have Confluence as well and we want to merge them together. We’ll end up with 25,000 users and it’s not the fastest now, and then we want to start using Collaborative Editing, and open it up to outside contractors, and …
CONFLUENCE ADMIN
Agenda
Understanding the problem
Let’s agree… to agree
Measure (allthethings)
Benchmarking
The art of tuning
Status Quo Is it ok today? It’s only a few seconds…
Be reasonable… Don’t compare your internal Jira instance to a supercomputer!
Expectations
Status Quo Is it ok today? It’s only a few seconds…
Be reasonable… Don’t compare your internal Jira instance to a supercomputer…
Expectations
Latency A little is ok, but a lot can be a big problem.
User Behavior temet nosce (know thyself)
Scalability I’ll see your 250 users, and raise you 2500, then 25,000…
Expectations
Urgency Trying to fix what’s broken, or make it better?
Who cares? Discover, discern, and prioritize!
Priorities!
Urgency Trying to fix what’s broken, or make it better?
Who cares? Discover, discern, and prioritize!
Now vs Later What’s most beneficial now? What would be helpful down the road?
Priorities!
Staffing Levels
Customer 1
Customer 2
Profile - 10,000 Users - Jira Data Center, Confluence Data Center - 1 Manager - 2 FT Sys Admin - 3 FT App Admins - 1 FT Dev - 1 FT SRE - 1 Architect
Assessment Well-staffed. Team runs all of our applications as well as other developer tools.
Staffing Levels
Customer 1
Customer 2
Profile - 20,000 Users - 2 Jira Data Center, 2 Confluence Data Center, 1 BB
Data Center, FeCru, 3 Bamboo - 1 Team Lead - 2 FT/1 PT App Admin - 1 FT/1PT Sys Admins
Assessment Under-staffed. Team runs all of our applications as well as at least 5 other tool across different geographies
Agenda
Understanding the problem
Let’s agree… to agree
Measure (allthethings)
Benchmarking
The art of tuning
It depends…
Bitbucket Make sure you monitor the CPU! (But don’t forget about DB load, SCM jobs, or disk speed, or…)
Bamboo Make sure you monitor the network! (But don’t forget about CPU load, or page load time, or…)
Jira Make sure you monitor disk I/O! (But don’t forget about heap use, or CPU load, or page load time, or…)
Confluence Make sure you monitor Memory! (But don’t forget about the db connection pool, or disk I/O, or…)
Agenda
Understanding the problem
Benchmarking
The art of tuning
Let’s agree… to agree
Measure (allthethings)
Extranet
Long pauses Garbage collection pauses of 10-20s
Nodes removed Lack of response means nodes are aggressively removed from the pool
Garbage Collection
Network latency
Load balancers
Extranet
Photo heavy blogs Latency causing slow image downloads
Garbage Collection
Network latency
Load balancers
Extranet10s health check Nodes are removed from the pool if there is no response for 10s
Idempotency Failed requests replay across other nodes
No draining period Requests fail immediately
Garbage Collection
Network latency
Load balancers
Know your infrastructure
Network
Database
App servers
Requests
Largest potential for problems Work with your networking teams
Access logging Pipe load balancer or Tomcat access logs into Splunk
Know your infrastructure
Network
Database
App servers
Requests
Latency Check in System Information, or ping
Slow queries Enable logging on the database
Know your infrastructure
Network
Database
App servers
Requests
CPU Datadog or Logic Monitor
I/O Datadog or Logic Monitor
Memory Enable GC logging, use GCViewer
Know your infrastructure
Network
Database
App servers
Requests
CPU Datadog or Logic Monitor
I/O Datadog or Logic Monitor
Memory Enable GC logging, use GCViewer
Know your infrastructure
Network
Database
App servers
Requests
HTTP threads New Relic
Load balancer Access logging
Database connections Datadog
Know your infrastructure
Network
Database
App servers
Requests
HTTP threads New Relic
Load balancer Access logging
Database connections Datadog
Know your infrastructure
Network
Database
App servers
Requests
HTTP threads New Relic
Load balancer Access logging
Database connections Datadog
Peak times Know your peak and low load
times
Find your benchmarks
Percentiles Averages mean nothing
Identify Establish baselines for each
area
Agenda
Understanding the problem
Benchmarking
The art of tuning
Let’s agree… to agree
Measure (allthethings)
Be patient Wait for peak and low load
times
Go slow
Know when to stop Is that last 100ms really
worth it?
Isolate Make one change at a time
and benchmark
Magic tricks
CPU Add more cores, limit concurrency
Garbage Collection Adding more memory != success
Database
App servers
Requests
Magic tricks
Threading More database connections than HTTP threads
Load balancer • Increase buffers • Turn off idempotency • Allow draining • ‘Least connections’ over
‘round robin’
Database
App servers
Requests
Limit complexity • Limit or combine custom fields
• Clean up unused plugins
• Keep general complexity of workflows low
Magic tricks
Jira
Confluence
Bitbucket Server
Tune for fault tolerance • Extend the cluster safety interval
• Turn off idempotency at your load balancer
• Know your garbage collection behaviour
Magic tricks
Jira
Confluence
Bitbucket Server
Load comes from git • Optimise for git over the JVM
• Scale vertically over horizontally
• Use docker and mirrors
Magic tricks
Jira
Confluence
Bitbucket Server
Planning for the future
Capacity planning
Alerting
HTTP threads # of requests in highest load minute / 60 * average time to complete = threads in use/s
Requests in highest load minute = 8400 Time to complete = 0.82s
8400 / 60 * 0.82 = 115 threads, or 29 per node
Planning for the future
Capacity planning
Alerting
Our alerts
• More than 500x 500 errors in a minute • More than 300 timeouts at the load balancer in an
hour • Garbage collection pauses > 10s • Nodes being removed/readded at the load balancer • Cluster panics • Out of memory errors • Long running space exports
Accept Your Reality There are limits to performance tuning. Be ok with what’s fast enough.
Data is Your Friend …but having good data takes time. Move slowly and methodically.
Accept Your Reality There are limits to performance tuning. Be ok with what’s fast enough.
Data is Your Friend …but having good data takes time. Move slowly and methodically.
Chill Go slowly. Track Everything. Lather. Rinse. Repeat. (Always repeat.)