distributed tracing in openstack - cern

Post on 22-Apr-2022

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Distributed Tracing in OpenStack

Ilya ShakhatHuawei Technologies, Munich Research Center27 May 2019

2

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

About the presenter

Ilya Shakhat• One of developers of Neutron LBaaS 1.0

• Co-author of Stackalytics

• Member of Scale and performance team

• Maintainer of performa/shaker[1] and performa/os-faults[2] tools

• Core reviewer of osprofiler library

[1] Distributed data-plane testing tool: https://opendev.org/performa/shaker [2] OpenStack fault-injection library: https://opendev.org/performa/os-faults

3

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

What is distributed tracing?

Observability = logs + metrics + tracing

• Request tracking in distributed systems.

• Performance and latency measurement.

• Service dependency analysis.

• System exploration and debugging.

• Root cause analysis.

[1] Maze is generated with http://www.mazegenerator.net/

4

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Trace models

Span model• Ideal for synchronous programming model.[1]

• Implementations: Jaeger, Zipkin,

OpenTracing, OpenCensus.

Event model• Designed for asynchronous programming

model and messaging patterns.[2]

• Trace is DAG (directed acyclic graph).

[1] Google Dapper: https://ai.google/research/pubs/pub36356[2] Facebook Canopy: https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/ [3] Diagram is from https://medium.com/opentracing/open-for-event-based-tracing-a326c295f2a2

[3]

5

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Tracing in OpenStack

Osprofiler• Project under Oslo umbrella.

• Instrumentation library – event collection and storage.

• CLI – event processing and visualization.

• Trace model is event-based, but with events aggregated into spans on the client side.

Tracing is enabled explicitly per each command, e.g.:

openstack --os-profile SECRET_KEY <command>

Trace can be viewed via CLI:

osprofiler trace show <trace-id>

6

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Code instrumentation

Context propagationAt boundaries:

• request is received;

• outgoing connection is made;

• system tool is called;

• DB query is executed.

At branching:

• a new thread is spawn.

Instrumentation code in libraries.

Service

REST API RPC API

REST APIclient RPC call RPC cast

System toolThread spawn

DB

7

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Demo setup

• OpenStack Stein installed using

PackStack in multi-node mode.

• Additional instrumentation in

oslo.service, oslo.concurrency,

oslo.privsep and neutronclient.

• OSProfiler with Zipkin driver.

• Spans are collected and processed

in Jaeger and stored in

Elasticsearch.

OpenStack

Keystone

Nova

Neutron

Glance

JaegerSpans

ES

8

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

System exploration

Traces help to understand the code flow in a distributed system.

[1] https://docs.openstack.org/nova/stein/reference/vm-states.html

Server creation in theory Real view in dynamic

[1]

9

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Performance analysis

Span are transformed into metrics giving a

view to internal operations, such as RPC calls.

Metric is extracted from spans and visualized in Kibana. Span duration visualized in Jaeger. Slower spans are more red.

Request profiling to find bottlenecks or critical

path analysis.

Outlier – 3 times longer than usual

Neutron operation takes most

10

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

RCA scenario setup

Server creation command:

Nova architecture diagram is based on https://docs.openstack.org/nova/stein/user/architecture.html

$ openstack --os-profile SECRET_KEY server create --network private --image cirros --flavor m1.tiny test

The command returns once DB object is

created, and VM is spawned in the

background. User has to poll Nova to get VM

status.

11

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Server creation trace

Response is sent to the user

Execution continues

asynchronously

12

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Fault injection

Injected fault: OVS DB service is down on

the compute node.

Note: Neutron OVS agent is still considered

alive (failure not detected yet).

Without tracing root-cause analysis is:

• grep logs for VM id;

• filter messages by request-id;

• jump to the next service along the path;

• repeat until the error is found.Nova architecture diagram is based on https://docs.openstack.org/nova/stein/user/architecture.html

13

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Fault observation

VM status is error:

Build of instance ac50cb4a-ad7c-4abb-8bee-d8d025b545a3 aborted: Failed to allocate

the network(s), not rescheduling.

Trace overview:

14

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Root-cause analysis

Error in VIF driver

Failed to call OVS utility

Very long operation (timeout)

15

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Trace comparison

Structural changesThe structure of a trace with fault

significantly differs from a normal one.

red – missing in trace with fault

green – missing in normal trace

16

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

How to use tracing now?

DevStack [1]

enable_plugin osprofiler https://opendev.org/openstack/osprofiler master OSPROFILER_COLLECTOR=redis

The plugin enables tracing in all OpenStack services and Tempest. The default driver is Redis.

Zuul Tempest job [2]

Zuul v3 makes it easy to configure a job to run Tempest tests with the tracing switched on.

Rally [3]

Collect traces for all iterations in Rally scenario.

[1] https://opendev.org/openstack/osprofiler/src/branch/master/devstack [2] https://opendev.org/openstack/osprofiler/src/tag/2.8.0/.zuul.yaml#L27-L42[3] https://review.opendev.org/#/c/615350/

17

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Available drivers

Driver Collector [1] View [2] Zuul job [3]

Redis ✓ ✓ ✓

SQLAlchemy (SQLite, Postgresql, MySQL) ✓ ✓ ✓

Elasticsearch ✓ ✓ ✕

MongoDB ✓ ✓ ✕

Jaeger ✓ ✕ ✕

Oslo.Messaging (deprecated) ✓ ✕ ✕

[1] Expose trace events from instrumented code [2] View traces in osprofiler CLI[3] Integration testing in OpenStack gate

18

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Future Work

OpenTracing compatibility

Why?• OpenTracing is de-facto a standard for distributed tracing.

• OpenTracing is a part of Open Telemetry initiative [1].

Benefits• Transparent tracing through OpenStack services.

• Out-of-the-box support for advanced platforms such as CNCF Jaeger

[1] https://opentelemetry.io/

19

PANTONE 186CRGB 200/16/46

PANTONE 185CRGB 199/0/11

Brand colors

RGB 234/90/79

RGB 120/0/15

Supporting colors

RGB 248/181/60

RGB 235/92/1

RGB 137/137/137

RGB 35/24/21

RGB 221/221/221

RGB 233/140/128

RGB 159/0/1

RGB 245/220/87

RGB 240/133/0

RGB 181/181/181

RGB 89/87/87

RGB 255/255/255

Thank you.

top related