microservices application tracing standards and simulators - adrians at oscon

47
Microservices Application Tracing Standards and Simulators From Zipkin to Greater Tracing: Involving a wider group of people in distributed tracing @adrianfcole @adrianco #oscon

Upload: adrian-cockcroft

Post on 16-Apr-2017

3.080 views

Category:

Software


0 download

TRANSCRIPT

Microservices Application Tracing Standards and SimulatorsFrom Zipkin to Greater Tracing: Involving a wider group of people in distributed tracing

@adrianfcole @adrianco#oscon

Introduction

introduction

opening zipkin

beyond zipkin

simulation

@adrianco @adrianfcole

@adrianfcole• spring cloud at pivotal• focused on distributed tracing• helped open zipkin

Opening Zipkin

introduction

opening zipkin

beyond zipkin

simulation

@adrianfcole

Distributed Tracing commoditizes knowledge

Distributed tracing systems collect end-to-end latency graphs (traces) in near real-time.

You can compare traces to understand why certain requests take longer than others.

Zipkin is like Chrome DevTool’s network panel!

http://zipkin.io/• .. except you see your whole architecture

It started with community focus

commit 92c941890c2009a401b777093342dc4f28955640 Author: Johan Oskarsson <[email protected]> Date: Tue Nov 15 10:09:47 2011 -0800

[split] Enable B3 tracing for TFE. Filter out finagle-http headers from incoming requests

BigBrotherBird is silently born

Zipkin is less silently born

commit 2b7acead044e71c744f39804abe564383eb5f846 Author: Johan Oskarsson <[email protected]> Date: Wed Jun 6 11:28:34 2012 -0700

Initial commit

zipkin says “we are a community”

(open)zipkin left the nest

So what happened?

Zipkin development at Twitter was in short bursts, centered on other work

Many experienced Zipkin engineers don’t work at Twitter (or in the Bay Area)

Platform diversity is a reality for many

Having the same goals was our opportunity

How’s OpenZipkin doing now?

Zipkin's now releasable (maybe too releasable)

We’re working on understandability on usability

We’re making the community easier to find

We hit bumps, and sometimes reverse change

Beyond Zipkin

introduction

opening zipkin

beyond zipkin

simulation

@adrianfcole

The “greater” tracing

Many groups are solving similar problems

Some focus on stacks, others on instrumentation

By collaborating more, we can make tracing greater

Instrumentation portability

Interop through shared trace pipelines.

Practical matters, like categorization and tactical designs

Moving R&D to implementation

Simulation and system testingdistributed-tracing google group

Distributed Tracing Workgroup

OpenTracing is an effort to clean-up and de-risk distributed tracing instrumentation

OpenTracing Interfaces decouple instrumentation from vendor-specific dependencies and terminology. This allows applications to switch products with less effort.

http://opentracing.io/

OpenTracing: Go, Python, Java, JavaScript..

A single configuration change to bind a Tracer implementation in main() or similar

import opentracing "github.com/opentracing/opentracing-go"import "github.com/tracer_x/tracerimpl"

func main() { // Bind tracerimpl to the opentracing system opentracing.InitGlobalTracer( tracerimpl.New(kTracerImplAccessToken))

... normal main() stuff ...}

How does it work?

Clean, vendor-neutral instrumentation code that naturally tells the story of a distributed operation

import opentracing "github.com/opentracing/opentracing-go"

func AddContact(c *Contact) { sp := opentracing.StartSpan("AddContact") defer sp.Finish() sp.LogEventWithPayload("Added contact: ", *c) subRoutine(sp, ...) ...}

func subRoutine(parentSpan opentracing.Span, ...) { ... sp := opentracing.StartChildSpan(parentSpan, "subRoutine") defer sp.Finish() sp.Info("deferred work to subroutine") ...}

Thanks, @el_bhs for the slide!

Pivot Tracing is applied research from Brown University (the one that brought us X-Trace).

Pivot tracing allows you to dynamically query systems at runtime, grouping on “Baggage” which propagates across service boundaries.

pivottracing.io

Pivot Tracing

Start writing queries including the fancy happened-before join

From incr In DataNodeMetrics.incrBytesReadJoin cl In First(ClientProtocols) On cl -> incrGroupBy cl.procNameSelect cl.procName, SUM(incr.delta)

How does it work?

Services need to be in Java and be able to talk to a provided PubSub broker.

// Add a library<dependency> <groupId>edu.brown.cs.systems</groupId> <artifactId>pivottracing-agent</artifactId> <version>4.0</version></dependency>

// Initialize it on bootstrapPivotTracing.initialize(); @brownsys_jmace

made this!

Simulation

introduction

opening zipkin

beyond zipkin

simulation

@adrianfcole

What does @adrianco do?

@adrianco

Technology Due Diligence on Deals

Presentations at Conferences

Presentations at Companies

Technical Advice for Portfolio

Companies

Program Committee for Conferences

Networking with Interesting PeopleTinkering with

Technologies

Maintain Relationship with Cloud Vendors

Previously: Netflix, eBay, Sun Microsystems, CCL, TCU London BSc Applied Physics

@adrianco

Testing Flow Monitors

Monitoring tools often “explode on impact” with real world use cases at scale

Interestingly large complex environments are expensive to create or hard to get access to

Free, open source tools don’t have a budget…

OSS Microservice Simulator

Model and visualize microservices Simulate interesting architectures Generate large scale configurations Stress test real tools like Zipkin

Code: github.com/adrianco/spigo Simulate Protocol Interactions in Go Visualize with D3, Neo4j or Guesstimate See for yourself: http://simianviz.surge.sh Follow @simianviz for updates

ELB Load Balancer

ZuulAPI Proxy

KaryonBusiness Logic

StaashData Access Layer

PriamCassandra Datastore

ThreeAvailabilityZones

DenominatorDNS Endpoint

POST Spigo flows to zipkin

# collect flows, duration 2 seconds, architecture lamp $ ./spigo -c —d 2 -a lamp —snip—

# clean out zipkin database and post newly created data $ misc/zipkin.sh lamp

Spigo Nanoservice Structurefunc Start(listener chan gotocol.Message) { ... for { select { case msg := <-listener: flow.Instrument(msg, name, hist)

switch msg.Imposition { case gotocol.Hello: // get named by parent ... case gotocol.NameDrop: // someone new to talk to ... case gotocol.Put: // upstream request handler ... outmsg := gotocol.Message{gotocol.Replicate, listener, time.Now(), msg.Ctx.NewParent(), msg.Intention} flow.AnnotateSend(outmsg, name) outmsg.GoSend(replicas) } case <-eurekaTicker.C: // poll the service registry ... } } } Skeleton code for sideways replicating a Put message

Instrument incoming requests

Instrument outgoing requests

update trace context

Flow Trace Records

riak2us-east-1

zoneC

riak9us-west-2

zoneA

Put s896

Replicate

riak3us-east-1

zoneA

riak8us-west-2

zoneC

riak4us-east-1

zoneB

riak10us-west-2

zoneB

us-east-1.zoneC.riak2 t98p895s896 Put us-east-1.zoneA.riak3 t98p896s908 Replicate us-east-1.zoneB.riak4 t98p896s909 Replicate us-west-2.zoneA.riak9 t98p896s910 Replicate us-west-2.zoneB.riak10 t98p910s912 Replicate us-west-2.zoneC.riak8 t98p910s913 Replicate

staashus-east-1

zoneC

s910 s908s913

s909s912

Replicate Put

context: transaction parent span

Zipkin Trace Dependencies

Zipkin Trace Dependencies

Trace for one Spigo Flow

Definition of an architecture

{ "arch": "lamp", "description":"Simple LAMP stack", "version": "arch-0.0", "victim": "webserver", "services": [ { "name": "rds-mysql", "package": "store", "count": 2, "regions": 1, "dependencies": [] }, { "name": "memcache", "package": "store", "count": 1, "regions": 1, "dependencies": [] }, { "name": "webserver", "package": "monolith", "count": 18, "regions": 1, "dependencies": ["memcache", "rds-mysql"] }, { "name": "webserver-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["webserver"] }, { "name": "www", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["webserver-elb"] } ] }

Header includeschaos monkey victim

New tier name

Tier package

0 = non Regional

Node count

List of tier dependencies

See for yourself: http://simianviz.surge.sh/lamp

Migrating to MicroservicesSee for yourself: http://simianviz.surge.sh/migration

Endpoint

ELB

PHP

MySQL

MySQL

Next step Controls node placement distance

Select models

Running Spigo$ ./spigo -a lamp -d 2 -j -c 2016/05/16 18:46:37 Loading architecture from json_arch/lamp_arch.json 2016/05/16 18:46:37 lamp.edda: starting 2016/05/16 18:46:37 HTTP metrics now available at localhost:8123/debug/vars 2016/05/16 18:46:37 Architecture: lamp Simple LAMP stack 2016/05/16 18:46:37 architecture: scaling to 100% 2016/05/16 18:46:37 Starting: {rds-mysql store 1 2 []} 2016/05/16 18:46:37 lamp.us-east-1.zoneB..eureka01...eureka.eureka: starting 2016/05/16 18:46:37 lamp.us-east-1.zoneC..eureka02...eureka.eureka: starting 2016/05/16 18:46:37 lamp.us-east-1.zoneA..eureka00...eureka.eureka: starting 2016/05/16 18:46:37 Starting: {memcache store 1 1 []} 2016/05/16 18:46:37 Starting: {webserver monolith 1 18 [memcache rds-mysql]} 2016/05/16 18:46:37 Starting: {webserver-elb elb 1 0 [webserver]} 2016/05/16 18:46:37 Starting: {www denominator 0 0 [webserver-elb]} 2016/05/16 18:46:37 lamp.*.*..www00...www.denominator activity rate 10ms 2016/05/16 18:46:38 chaosmonkey delete: lamp.us-east-1.zoneA..webserver09...webserver.monolith 2016/05/16 18:46:39 asgard: Shutdown 2016/05/16 18:46:39 Saving 30 histograms for Guesstimate 2016/05/16 18:46:39 lamp.us-east-1.zoneA..eureka00...eureka.eureka: closing 2016/05/16 18:46:39 lamp.us-east-1.zoneC..eureka02...eureka.eureka: closing 2016/05/16 18:46:39 lamp.us-east-1.zoneB..eureka01...eureka.eureka: closing 2016/05/16 18:46:39 spigo: complete 2016/05/16 18:46:39 lamp.edda: closing 2016/05/16 18:46:39 Flushing flows to json_metrics/lamp_flow.json

-a architecture lamp-d run for 2 seconds-j graph json/lamp.json-c flows json_metrics/lamp_flow.json

Riak IoT Architecture{ "arch": "riak", "description":"Riak IoT ingestion example for the RICON 2015 presentation", "version": "arch-0.0", "victim": "", "services": [ { "name": "riakTS", "package": "riak", "count": 6, "regions": 1, "dependencies": ["riakTS", "eureka"]}, { "name": "ingester", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakTS"]}, { "name": "ingestMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["ingester"]}, { "name": "riakKV", "package": "riak", "count": 3, "regions": 1, "dependencies": ["riakKV"]}, { "name": "enricher", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakKV", "ingestMQ"]}, { "name": "enrichMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["enricher"]}, { "name": "analytics", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingester"]}, { "name": "analytics-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["analytics"]}, { "name": "analytics-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["analytics-elb"]}, { "name": "normalization", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["enrichMQ"]}, { "name": "iot-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["normalization"]}, { "name": "iot-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["iot-elb"]}, { "name": "stream", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingestMQ"]}, { "name": "stream-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["stream"]}, { "name": "stream-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["stream-elb"]} ] }

New tier name

Tier package

Node count

List of tier dependencies

0 = non Regional

Single Region Riak IoTSee for yourself: http://simianviz.surge.sh/riak

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

See for yourself: http://simianviz.surge.sh/riak

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Load Balancer

Load Balancer

See for yourself: http://simianviz.surge.sh/riak

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Normalization Services

Load Balancer

Load Balancer

Stream Service

Analytics Service

See for yourself: http://simianviz.surge.sh/riak

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Normalization Services

Enrich Message Queue Riak KV

Enricher Services

Load Balancer

Load Balancer

Stream Service

Analytics Service

See for yourself: http://simianviz.surge.sh/riak

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Normalization Services

Enrich Message Queue Riak KV

Enricher Services

Ingest Message Queue

Load Balancer

Load Balancer

Stream Service

Analytics Service

See for yourself: http://simianviz.surge.sh/riak

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Normalization Services

Enrich Message Queue Riak KV

Enricher Services

Ingest Message Queue

Load Balancer

Load Balancer

Stream Service Riak TS

Analytics Service

Ingester Service

See for yourself: http://simianviz.surge.sh/riak

Two Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

East Region Ingestion

West Region Ingestion

Multi Region TS Analytics

See for yourself: http://simianviz.surge.sh/riak

Spigo with Neo4j$ ./spigo -a netflix -d 2 -n -c -kv chat:200ms 2016/05/18 12:07:08 Graph will be written to Neo4j via NEO4JURL=localhost:7474 2016/05/18 12:07:08 Loading architecture from json_arch/netflix_arch.json 2016/05/18 12:07:08 HTTP metrics now available at localhost:8123/debug/vars 2016/05/18 12:07:08 netflix.edda: starting 2016/05/18 12:07:08 Architecture: netflix A simplified Netflix service. See http://netflix.github.io/ to decode the package names 2016/05/18 12:07:08 architecture: scaling to 100% 2016/05/18 12:07:08 Starting: {cassSubscriber priamCassandra 1 6 [cassSubscriber eureka]} 2016/05/18 12:07:08 netflix.us-east-1.zoneA..eureka00...eureka.eureka: starting 2016/05/18 12:07:08 netflix.us-east-1.zoneB..eureka01...eureka.eureka: starting 2016/05/18 12:07:08 netflix.us-east-1.zoneC..eureka02...eureka.eureka: starting 2016/05/18 12:07:08 Starting: {evcacheSubscriber store 1 3 []} 2016/05/18 12:07:08 Starting: {subscriber staash 1 3 [cassSubscriber evcacheSubscriber]} 2016/05/18 12:07:08 Starting: {cassPersonalization priamCassandra 1 6 [cassPersonalization eureka]} 2016/05/18 12:07:08 Starting: {personalizationData staash 1 3 [cassPersonalization]} 2016/05/18 12:07:08 Starting: {cassHistory priamCassandra 1 6 [cassHistory eureka]} 2016/05/18 12:07:08 Starting: {historyData staash 1 3 [cassHistory]} 2016/05/18 12:07:08 Starting: {contentMetadataS3 store 1 1 []} 2016/05/18 12:07:08 Starting: {personalize karyon 1 9 [contentMetadataS3 subscriber historyData personalizationData]} 2016/05/18 12:07:08 Starting: {login karyon 1 6 [subscriber]} 2016/05/18 12:07:08 Starting: {home karyon 1 9 [contentMetadataS3 subscriber personalize]} 2016/05/18 12:07:08 Starting: {play karyon 1 9 [contentMetadataS3 historyData subscriber]} 2016/05/18 12:07:08 Starting: {loginpage karyon 1 6 [login]} 2016/05/18 12:07:08 Starting: {homepage karyon 1 9 [home]} 2016/05/18 12:07:08 Starting: {playpage karyon 1 9 [play]} 2016/05/18 12:07:08 Starting: {wwwproxy zuul 1 3 [loginpage homepage playpage]} 2016/05/18 12:07:08 Starting: {apiproxy zuul 1 3 [login home play]} 2016/05/18 12:07:08 Starting: {www-elb elb 1 0 [wwwproxy]} 2016/05/18 12:07:08 Starting: {api-elb elb 1 0 [apiproxy]} 2016/05/18 12:07:08 Starting: {www denominator 0 0 [www-elb]} 2016/05/18 12:07:08 Starting: {api denominator 0 0 [api-elb]} 2016/05/18 12:07:08 netflix.*.*..api00...api.denominator activity rate 200ms 2016/05/18 12:07:09 chaosmonkey delete: netflix.us-east-1.zoneA..homepage03...homepage.karyon 2016/05/18 12:07:10 asgard: Shutdown 2016/05/18 12:07:10 Saving 108 histograms for Guesstimate 2016/05/18 12:07:10 Saving 108 histograms for Guesstimate 2016/05/18 12:07:10 netflix.us-east-1.zoneC..eureka02...eureka.eureka: closing 2016/05/18 12:07:10 netflix.us-east-1.zoneA..eureka00...eureka.eureka: closing 2016/05/18 12:07:10 netflix.us-east-1.zoneB..eureka01...eureka.eureka: closing 2016/05/18 12:07:10 spigo: complete 2016/05/18 12:07:11 netflix.edda: closing

-a architecture netflix-d run for 2 seconds-n graph and flows written to Neo4j-c flows json_metrics/netflix_flow.json-kv chat:200ms start flows at 5/sec

Neo4j Visualization

Neo4j Trace Flow Queries

@adrianco

Conclusion

Monitoring tools can be stressed at scale by simulating their inputs with metrics, dependency

graphs and flows

Spigo can be extended to produce any format output at very large scale from a laptop

Ask Adrians

@adrianco @adrianfcole

distributed-tracing google group

opentracing.io zipkin.io

github.com/adrianco/spigo

@opentracing @simianviz @zipkinproject

pivottracing.io