prometheus – a next-gen monitoring system
TRANSCRIPT
![Page 1: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/1.jpg)
PrometheusA next-generation monitoring system
Fabian Reinartz – Production Engineer, SoundCloud Ltd.
![Page 2: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/2.jpg)
Monitoring at SC 2012 – from monolith ...
![Page 3: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/3.jpg)
... to micro services
![Page 4: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/4.jpg)
Monitoring at SC 2012
Service A
Service B
Service C
StatsD Graphite
![Page 5: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/5.jpg)
History – monitoring at SoundCloud 2012
Source: http://eugenedvorkin.com/seven-micro-services-architecture-problems-and-solutions/
![Page 6: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/6.jpg)
History – monitoring at SoundCloud 2012
Source: http://blog.sflow.com/2011/12/using-ganglia-to-monitor-java-virtual.html
![Page 7: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/7.jpg)
History – monitoring at SoundCloud 2012
Source: http://www.bellarmine.edu/faculty/amahmood/tier3/monitoring.html
![Page 8: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/8.jpg)
P R O M E T H E U S
![Page 9: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/9.jpg)
Prometheus
- started by Matt Proud and Julius Volz as an Open Source project
- first commit 24-11-2012
- public announcement in January 2015
- inspired by Borgmon
- not Borgmon
![Page 10: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/10.jpg)
Features – multi-dimensional data model
http_requests_total{instance=”web-1”, path=”/index”, status=”401”, method=”GET”}
#metrics x #labels x #values ▶ millions of time series
![Page 11: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/11.jpg)
Features – powerful query language
topk(3, sum by(path, method) (rate(http_requests_total{status=~”5..”}[5m])
))
histogram_quantile(0.99, sum by(le, path) (
rate(http_requests_duration_seconds_bucket[5m])
))
![Page 12: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/12.jpg)
Features – powerful query language
topk(3, sum by(path, method) (rate(http_requests_total{status=~”5..”}[5m])
))
{path=”/api/comments”, method=”POST”} 105.4
{path=”/api/user/:id”, method=”GET”} 34.122
{path=”/api/comment/:id/edit”, method=”POST”} 29.31
![Page 13: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/13.jpg)
Features – easy to use, yet scalable
- single static binary, no dependencies
$ go get github.com/prometheus/prometheus/cmd/...
$ prometheus
- local storage
- high-throughput [millions of time series, 380,000 samples/sec]
- efficient compression
![Page 14: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/14.jpg)
![Page 15: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/15.jpg)
Integrations
![Page 16: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/16.jpg)
Instrument – natively
var httpDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Namespace: namespace, Name: "http_request_duration_seconds", Help: "A histogram of HTTP request durations.", Buckets: prometheus.ExponentialBuckets(0.0001, 1.5, 25), }, []string{"path", "method", "status"},)
func handleAPI(w http.ResponseWriter, r *http.Request) { start := time.Now()
// do work
httpDuration.WithLabelValues(r.URL.Path, r.Method, status).Observe(time.Since(start).Seconds())}
![Page 17: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/17.jpg)
Features – built-in expression browser
![Page 18: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/18.jpg)
Features – native Grafana support
![Page 19: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/19.jpg)
Features – PromDash
![Page 20: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/20.jpg)
![Page 21: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/21.jpg)
D O E S I T S C A L E ?
![Page 22: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/22.jpg)
Features – federation & sharding
Cluster A Cluster B
Cluster C
service metrics container metrics
![Page 23: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/23.jpg)
![Page 24: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/24.jpg)
S E R V I C E D I S C O V E R Y
![Page 25: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/25.jpg)
DNS SRV
$ dig +short SRV all.foo-api.srv.int.example.com0 0 4738 ip-10-22-11-32.int.example.com.0 0 3433 ip-10-22-11-32.int.example.com.0 0 5934 ip-10-22-11-34.int.example.com.0 0 5093 ip-10-22-11-42.int.example.com.0 0 4589 ip-10-22-11-43.int.example.com.0 0 9848 ip-10-22-12-11.int.example.com.[...]
![Page 26: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/26.jpg)
DNS SRV
scrape_configs:- job_name: "foo-api" metrics_path: "/metrics"
dns_sd_configs: - names: ["all.foo-api.srv.int.example.com"] refresh_interval: 10s
![Page 27: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/27.jpg)
Fancy SD
- Consul- Kubernetes- Zookeeper- EC2- Mesos-Marathon
- … any via file-based plugins
Relabel based on SD data.
![Page 28: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/28.jpg)
Relabeling
relabel_config: action: replace source_labels: [__address__, __telemetry_port] target_label: __address__ regex: (.+):(.+);(.+) replacement: $1:$3
OUT
“__address__”: “10.44.12.135:82432”
“__telemetry_port”: “82432”
“cluster”: “AB”
“environment”: “production”
IN
“__address__”: “10.44.12.135:25431”
“__telemetry_port”: “82432”
“cluster”: “AB”
“environment”: “production”
![Page 29: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/29.jpg)
AWS EC2
scrape_configs:- job_name: "foo-api" metrics_path: "/metrics" ec2_sd_configs: - region: us-east-1 refresh_interval: 60s port: 80
The following meta labels are available during relabeling:- __meta_ec2_instance_id: the EC2 instance ID- __meta_ec2_public_ip: the public IP address of the instance- __meta_ec2_private_ip: the private IP address of the instance, if present- __meta_ec2_tag_<tagkey>: each tag value of the instance
![Page 30: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/30.jpg)
AWS EC2 – relabeling
relabel_configs:- source_labels: [__meta_ec2_tag_Type] action: keep regex: foo-api- source_labels: [__meta_ec2_tag_Deployment] action: replace target_label: deployment regex: (.+) replacement: $1
![Page 31: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/31.jpg)
A L E R T M A N A G E R
![Page 32: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/32.jpg)
Alerting
- no opinions
- directly defined on time series data
- verbose on firing ▶ compact but detailed on notifcation
![Page 33: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/33.jpg)
Alerting
ALERT HighErrorRate
IF sum by(job, path)(rate(http_requests_total{status=~”5..”}[5m])) /
sum by(job, path)(rate(http_requests_total[5m])) * 100 > 1
FOR 10m
SUMMARY “high number of 5xx errors”
DESCRIPTION “{{$labels.job}} has {{$value}}% 5xx errors on {{ $labels.path }}”
![Page 34: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/34.jpg)
Alerting
{path=”/api/comments”, method=”POST”} 5.43
{path=”/api/user/:id”, method=”GET”} 1.22
{path=”/api/comment/:id/edit”, method=”POST”} 1.01
![Page 35: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/35.jpg)
Alerting
ALERT HighErrorRate
IF ... * 100 > 1
FOR 10m
WITH { severity = “warning” } …
ALERT HighErrorRate
IF ... * 100 > 3
FOR 10m
WITH { severity = “critical” } …
![Page 36: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/36.jpg)
ALERTMANAGER
a l e r t s
silence
inhibit
g r o u p d e d u p r o u t e
PagerDuty
Slack
...
![Page 37: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/37.jpg)
Alerting
ALERT DiskWillFillIn4Hours
IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0
FOR 5m
SUMMARY “device filling up”
DESCRIPTION “{{$labels.device}} mounted on {{$labels.mountpoint}} on
{{$labels.instance}} will fill up within 4 hours.”
http://www.robustperception.io/reduce-noise-from-disk-space-alerts/
![Page 38: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/38.jpg)
D E M O
![Page 39: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/39.jpg)
Turing complete
http://www.robustperception.io/conways-life-in-prometheus/
![Page 40: Prometheus – a next-gen Monitoring System](https://reader030.vdocuments.site/reader030/viewer/2022013106/58706e101a28ab48378b6f1f/html5/thumbnails/40.jpg)
Recording rules
job:http_requests:rate5m = sum by(job) (rate(http_requests_total[5m])
)