distributed tracing - get a grasp on your production
TRANSCRIPT
@nklmish
Distributed tracing -
get a grasp on your production
“the most wanted and missed tool in the microservice world”
@nklmish
Agenda
Why latency ?
Distributed tracing
Short demo
Zipkin & core concepts
Code walkthrough
@nklmish
Latency
@nklmish
Every little bit count
@nklmish
With scale, you see
(source: https://gist.github.com/hellerbarde/2843375)
@nklmish
Latency?
@nklmish
User waiting
@nklmish
Remember, slow pages lose users
@nklmish
Distributed systems - latency analysis
@nklmish
Story time: How bob meet longtail latency
@nklmish
Bob didn’t knew he was suffering from Longtail latency
@nklmish
Bob trying to troubleshooting longtail latency in distributed system
@nklmish
Option 1: Log Analysis
@nklmish
Lots of files
@nklmish
Looking in logs
@nklmish
Not everything in critical path.
@nklmish
Correlating logs, manual works
@nklmish
It simply doesn’t make sense
@nklmish
Option 2: What about Metrics?
(source: https://gist.github.com/hellerbarde/2843375)
@nklmish
Something is wrong
(source: https://gist.github.com/hellerbarde/2843375)
@nklmish
Can’t tell the cause
(source: https://gist.github.com/hellerbarde/2843375)
?
@nklmish
Aggregates (avg, stdev) may deceive
(source: https://gist.github.com/hellerbarde/2843375)
@nklmish
Bob, could we find out how many clients are impacted ?
@nklmish
Bob learn about percentiles
@nklmish
Clients impacted by longtail latency…
Percentile: 99th => 1 out of 100 visit experience D
Total visits experience delay: N ÷ 100 => 5,000
Total visits affected: 8%N => 40,000 Impacts:a. Lot of visits b. Repeated visits in a day
1 visit (In our distributed system): 8 downstream calls =>interacting with S
(99% fast & 1% slow)
N: No. of visits (500,000) D: Delay (50 ms) S: Highly active service(suffering from longtail latency)
1 visit encountering latency: 1-(0.99^8) = 1-0.922 => 0.077 ≈ 8%(likelihood)
@nklmish
Boss need solution
@nklmish
But we still don’t know…
Request timeline (When it started & which operation)
Logs-Correlation
How the same operation behaved across different cluster/region/zone.
How much deviation comparing to acceptable value.
Call graph
@nklmish
Bob was missing Distributed Tracing
@nklmish
Distributed tracing
Tracks request flow.
Fast reaction (Traced data available within mins)
Dynamically instruments apps.
System insight, critical path, understanding call graphs (which services, which operations, at what time, etc.)
Measuring E2E latency
Call patterns (Optimisation) & bug discovering (Spotting redundant requests, sync vs async)
@nklmish
How can we apply this knowledge
@nklmish
Via Tracing system
Tracing system should:
Trace
Have Low overhead
Be scaleable
Work 24 * 7 * 365 (production bugs are difficult to reproduce)
Shouldn’t :
Rely on programmers collaboration
@nklmish
OpenZipkin - OpenSource tracing system
@nklmish
OpenZipkin
Zipkin is:
Distributed tracing system
Created by twitter
Based on Dapper.
OpenZipkin:
Github organisation
Primary Fork of Zipkin
Opensource
Pluggable architecture
@nklmish
Span
Denotes logical unit of work done (Timestamped)
Work done is expressed in human readable string (operation name)
Created by tracer (instrumenting code)
Slim (KiB or less)
Root span - span without parent id
@nklmish
Zipkin annotations
Clien
t
Serv
er
cs
sr
ss
HTTP Request: get catalog
(span starts)
cr
HTTP Response: catalog
(span ends)
(Processing time = ss - sr)
(Response time = cr - cs)
(Network latency = sr - cs)
(Network latency = cr - ss)
cs: cl
ient s
end
ss: ser
ver se
nd
cr: cl
ient r
eceive
d
sr: ser
ver re
ceived
@nklmish
It’s all about trace & span
HTTP Request: get catalog CataloService: getCatalog()
(traceId:1, parentId:, spanId: 1)
PriceService: getPrice()
(traceId:1, parentId: 1, spanId: 2)
ProductService: getProducts()
(traceId:1, parentId: 1, spanId: 3)
Database call (traceId:1, parentId: 3,
spanId: 4)
Data analytic call (traceId:1, parentId: 3,
spanId: 5)
SpanTrace
@nklmish
Trace (E2e latency graph)
DAG of spans, forms latency tree.
@nklmish
Demohttps://github.com/nklmish/java-
distributed-tracing-demo
https://github.com/nklmish/go-distributed-tracing-demo
@nklmish
Demo application - Zipkin visualises dependencies
@nklmish
Zipkin’s architecture
APICollector UI
Transport
service (instrume-nted)
Storage
Receive spans
Scribe/kafkaDeserialising, sampling & scheduling for storage
DB
Store spans
cassandra/mysql/elastic-search
visualize
retrieves data
Collect & convert spans
@nklmish
Tags
Tag denotes:
key-value pair
Not timestamped
A span may contain zero or more tags
@nklmish
Log
Log denotes:
Event name (mark meaningful moment in lifetime of a span)
Timestamped
A span may contain zero or more logs
@nklmish
Annotations
Helps explaining latency with a timestamp.
Annotations are often codes. e.g. sr, cs, etc.
@nklmish
Binary Annotations
Tags a span with context, usually to support query or aggregation. (e.g. http.path)
Repeatable and vary on the host.
@nklmish
Can I have large spans ( e.g. MiB)
Decrease usability & increases cost of tracing system
@nklmish
Beware of clock skew!!!
10:00 10:00
@nklmish
Beware of clock skew!!!
10:00:01 10:00:22
@nklmish
Tracer
Does most of the heavy lifting e.g. span creation, context generation, passing info, data propagation, etc.
@nklmish
Sampling
Controls how much to record
High traffic Systems, fraction of traffic is enough
Low traffic Systems, adjust based on your needs
Note: Debug spans are always recorded.
@nklmish
Opentracing
Standardise tracing
Vendor neutral tracing API
Implementation available in 6 languages
http://opentracing.io/documentation/
@nklmish
Spring cloud sleuth zipkin
Brings distributed tracing to spring cloud
Spring cloud starter zipkin (Zipkin + sleuth)
Supports
Hystrix
Async
Rest template
Feign
Zuul
Spring integration
…
http://tiny.cc/scs-doc
@nklmish
Code Walkthroughhttps://github.com/nklmish/java-
distributed-tracing-demo
https://github.com/nklmish/go-distributed-tracing-demo
@nklmish
Zipkin & Prometheus
@nklmish
Zipkin for…
@nklmish
Summary : Latency is never zero, embrace it
@nklmish
Summary
Distributed systems hard to reason, complex call graphs
Distributed tracing helps to analyse E2E latency & understanding call graphs
Instrumentation is tricky (async, thread pool, callbacks, etc.)
OpenZipkin provides:
open source tracing system
Visualises request flow
Spring cloud sleuth brings tracing to spring world
OpenTracing - goal to standardised tracing
@nklmish
Thank You
Questions?
http://tiny.cc/tracinghttp://tiny.cc/tracing-slidesSlides =>
Review =>
Source Code