![Page 1: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/1.jpg)
Monitoring to the Nth (tier)...or, State of Distributed Tracing 2016
Dan KuebrichCTO AppNeta@dkuebric
![Page 2: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/2.jpg)
Outline
● What is distributed tracing?
● Who’s doing it, and how?
● Challenges, and future directions?
![Page 3: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/3.jpg)
![Page 4: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/4.jpg)
● Frontend web app: PHP
● Text search: lucene-based, via thrift
● Pricing service: erlang, via thrift
● Content provider search: ruby, via thrift
● Spelling corrector: python bindings around xapian, via thrift
● ...
●
Thrift Shop
![Page 5: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/5.jpg)
cache(memcached)
search (lucene)
cache(memcached)
app1
ApachePHP
app1
ApachePHP
fw1
perlbal
cache(memcached)
fw2
perlbal
...
search (lucene)
db2
Mysql
search (lucene)
app server
ApachePHP
search (lucene)
search (lucene)
API search (ruby)
pricing (erlang)
spelling (python)
APIs
APIs
db1
Mysql
![Page 6: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/6.jpg)
Q: Why do you remember this so well?
![Page 7: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/7.jpg)
Q: Why do you remember this so well?
A: ops
![Page 8: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/8.jpg)
“Close enough” architectural diagram
https://www.flickr.com/photos/clonedmilkmen/3604999084
![Page 9: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/9.jpg)
Things we had
● Ganglia
● Nagios
● Thrift
○ Per-service status page
○ Service status page
● Logs
![Page 10: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/10.jpg)
1. Hit refresh N times -- how many times were problematic?
2. Are any services outright down?
3. Systematically tail the logs of every service on every machine
4. Check mysql running processes
5. SSH in and poke around
6. Deploy debug logging
7. Pray
Sample debug workflow
![Page 11: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/11.jpg)
X-Trace
![Page 12: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/12.jpg)
Instrumentation points and request flow
Web server
Application
Web server
Application
Web server
Application
Database
Service
Load balancer
Cache
3rd party API
![Page 13: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/13.jpg)
![Page 14: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/14.jpg)
Spans
![Page 15: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/15.jpg)
Spans
![Page 16: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/16.jpg)
Great minds…Distributed tracing based on ID propagation
● Google Dapper (200x? Published paper 2010)● Twitter Zipkin (Open-sourced 2012)● Etsy (2014ish)● Others
Commercial APM -- some distributed tracing
● New Relic● AppDynamics● DynaTrace
![Page 17: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/17.jpg)
Instrumentation points and request flow
Web server
Application
Web server
Application
Web server
Application
Database
Service
Load balancer
Cache
3rd party API
![Page 18: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/18.jpg)
Challenges: Instrumentation Points
def interesting_method():
log_entry(...)
_do_stuff()
log_exit(...)
![Page 19: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/19.jpg)
OpenTracing● Problematic to tie instrumentation to tracing system
● There is no one system that’s perfect for everyone
● So instrumentation that ties you to a system is bad● Either have it be automatically injected (industry)● … or obey a common interface so it’s pluggable
● OpenTracing v1 goal: provide the interface for portable instrumentation
![Page 20: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/20.jpg)
Challenges: Trace ID Propagation
def http_rpc_call():
log_entry(...)
_do_get(modified_headers, ...)
log_exit(...)
![Page 21: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/21.jpg)
def interesting_method(trace_id):
log_entry(trace_id, ...)
_do_stuff()
log_exit(trace_id, ...)
Challenges: Trace ID Propagation
![Page 22: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/22.jpg)
Challenges: Extracting Value
![Page 23: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/23.jpg)
Distributed tracing “only”
● Follow request flow through application● Understand end-to-end latency● Associate backend load with frontend
requests● Provide errors with distributed context
But... as long as you’re in there...
● Latency of queries, RPC calls, in each tier● Slow code● Cache hit/miss ratio● Errors and exceptions● Custom tagging/categorization of data● ...
Rich data set
![Page 24: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/24.jpg)
![Page 25: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/25.jpg)
![Page 26: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/26.jpg)
![Page 27: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/27.jpg)
![Page 28: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/28.jpg)
![Page 29: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/29.jpg)
Context propagation: beyond performance
● Baggage
● Deadlines
● Auth/load attribution
● Flow control?
![Page 30: Monitoring to the Nth tier: The state of distributed tracing in 2016](https://reader031.vdocuments.site/reader031/viewer/2022030316/587604f31a28ab4a508b665f/html5/thumbnails/30.jpg)
OFFICE HOURS
3pm
MORE INFO
Booth #713 & back of the room
@dkuebric
Thanks!