volta: logging, metrics, and monitoring as a service
TRANSCRIPT
Volta: Logging, Metrics and Monitoring as a Service
LN RenganarayanaTechnical Director / Architect
Cloud Platform [email protected]
twitter: @lrengan
1Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
Outline
• Motivation: data and events are the foundation of business
• Why build a (new) Service?
• What have we built: a (near) real-time data analytics pipeline
• The journey and lessons learned
• Looking ahead: Volta next gen
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec2
Data and events : the foundation
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec3
Picture: “Devops with S for sharing”, Patrick Debois
which features
to build?
what is a good
pricing model?
how fast can I
build?
what is the perf
of my code?
how is the
service?
what is
my
capacity?
what is
my
current
usage?
Why build a (new) service?
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec4
Why build a service?
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec5
Picture: Jim Nisbet & Philip O’Toole AWS re:Invent 2013 Loggly presentation
Single place for events across the stack
Volta / Cloud Platform Engineering, Symantec6
Jan 7, 2015
Bare Metal
IaaS (OpenStack)
Platform ServicesBP, SP, KV, OBS
Symantec Services & Apps
Volta
Identity Manager
CI / CD
Common Services
Volta : Design Goals
• Design for both Developers and Ops
– Make it extremely simple to capture events
– provide powerful search and visualization tools
• Secure, Multi Tenant : well we are Symantec, so Security comes first
• Scalable : elastically scale with load
• Highly Available: Volta is the eyes & ears for the Operations
• One system for logs, metrics, monitoring & other events
• Build using open source tools and for open sourcing
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec7
What we have built ...
A (near) real-time data analytics pipeline
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec8
Volta Client View
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec9
App
Platform
Services
Writes app
metrics directly
Infrastructure
SN
MP
Vars
expose
metr
ics
JM
X
Pull
Metrics
Push
Metrics
Volta
Shipper
VM
logs
Volta
metrics log events
Ale
rts &
Co
nfig
UI
Push: StatsD, metrics extension for openstack
Pull: CollectD. Shipper: logstash, moving to Heka
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec10
Kafka cluster
knode1
Keystone
knode2 knode3 knodeN...
log, metric, alert events
Storm cluster
Front End Cluster: Multi-tenancy and Kibana, Graphana Proxies
Elastic
SearchElastic
SearchRedis
Alerts email &
callbacks
Load Balancer
Client App / Service
s1 s2 s3 s4 ... sn
log & metrics shipper
log, metric & alert events
InfluxDBInfluxDB
InfluxDB
Metr
ics S
tore
Elastic
SearchElastic
SearchElastic
SearchLog S
tore
Authentication, Validation, Alerts Processing
Vo
lta
Un
de
r th
e H
oo
d
Quota
&
Policy
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec11
Kafka cluster
knode1 knode2 knode3 knodeN...
log, metric, alert events
Client App / Service
log & metrics shipper
log, metric & alert events
The Ingest Pipeline
VIP
• Kafka – replicated, fault
tolerant, persistent
message queue
• LogTopic, MetricTopic,
AlertTopic
• each topic is split into
partitions
• per topic retention policy
Event processing and storage
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec12
Storm cluster
Elastic
SearchElastic
SearchRedis
Alerts email &
callbacks
log, metric & alert events
InfluxDBInfluxDB
InfluxDB
Metr
ics S
tore
Elastic
SearchElastic
SearchElastic
SearchLog S
tore
Authentication, Validation, Alerts Processing
Quota
&
Policy
• alert rules
• [tenantid,
apikey] pairs
• Per tenant per day index
• Index typed fields
• Quota and retention policy
• Tenant id prefixed time series names
• Continuous queries do rollups
• Retention policy through rollups
Multi-tenancy Proxy & UI
Volta / Cloud Platform Engineering, Symantec13
Keystone
Front End Cluster: Multi-tenancy and Kibana, Graphana Proxies
Elastic
SearchElastic
SearchRedis
Load Balancer
s1 s2 s3 s4 ... sn
InfluxDBInfluxDB
InfluxDB
Metr
ics S
tore
Elastic
SearchElastic
SearchElastic
SearchLog S
tore
• Intercepts and rewrites queries
to ES and InfluxDB
• Enforces Multi-tenancy
(visibility of events to users)
Security and Multi-tenancy model
• Authentication with Keystone backed by LDAP
– user authentication for Query API and UI
• Multi tenancy with users and groups
– Events have tenant id and apikey
• Cross tenant correlation
– group membership used for cross-tenant event visibility / correlation
• Dashboard sharing
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec14
Retention Policy : Log Events
• ElasticSearch allows powerful querying, but comes at a cost– Store only logs that would help better operate and trouble shoot
– Use appropriate debug levels (not INFO)
• Fixed quota : 350 GB or 500 GB
• When tenant reaches quota limit, Volta will delete 20 % of old logs to free up space
• Through wise use of quota you can retain logs for lots of days
• Volta can retain logs for longer duration, for special tenants who need to store them for compliance / audit
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec15
Metric Events: Retention Policy and Rollups
Naming scheme:
host + “.” + name + “.” + type_if_avail + “.” + retention_period
Retention period: 1 day, 1 week, 1 month, and 3 months:
Names for the example:
● default 1 day: lmm-dev-bastion.memory.used_
● 1 week: lmm-dev_bastion.memory.used_1w
● 1 month: lmm-dev_bastion.memory.used_1m
● 3 months: lmm-dev_bastion.memory.used_3m
rollup precision:
● default 1 day: user defined (highest)
● 1 week: metrics aggregated to 1 minute
● 1 month: metrics aggregated to 5 minutes
● 3 months: metrics aggregated to 1 hour
Naming scheme & retention policies
{
"@version": "1",
"@timestamp": "2014-08-06T19:17:43.000Z",
"host": "lmm-dev-bastion",
"name": "memory",
"collectd_type": "memory",
"type_instance": "used",
"value": 341884928,
"tenant_id": "db5ca8e4c8514fad9f98dbc4d648ee87",
"apikey": "26d85ae3-1e10-4ce4-837a-7a1c8dfc67fb"
}
Sample for metric from collectd
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec16
Alerts : Email and Callbacks
• Alerts can be set using the Alert UI or the REST API
• Alerts can be sent to Email or post Webhook (REST endpoint)
• Webhook provides a good mechanism for integration with external automation and UIs
• Alerts on Log events– User specifies an alert template using regular expression to match
– Can match one or more fields from a Log event
– Simple and complex expressions
• Alerts on Metric events– User specifies an alert template using comparison operators
– Can match one or more fields from the Metric event
– Simple and complex expressions
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec17
Current deployment
• Multiple deployments : on bare KVM nodes, on OpenStack VMs
– On KVM nodes: 40+ VMs, 80+ TB storage, many large memory nodes
– Components are deployed in clustered mode for HA
– Some with active/active replication, some with active/passive
• Use by Platform and Infrastructure Services
– Tens of thousands of events per second (seen around 160 K events /sec)
– Hundreds of GBs of data collected and indexed per day
– Queries are currently coming from Kibana and Grafana, in future from APIs
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec18
The Journey and Lessons ...
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec19
Log, metrics and alerts
• log events– insist on good severity levels,
– enforce quota induce behavior change
– watch out for large messages (zip lines from stdout/stderr)
• metric events– keep users aware of rollups (granularity)
• alerts– watch out for too simple ones alert floods
– watch out for complex regex performance / memory suckers
– encourage metrics based alerts this is what scales
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec20
Kafka, ES and Storm• Kafka
– retention policy vs storage space: do the math with ingest & processing rate
– if you are not using auto-rebalance of leaders, keep an eye on the leaders
• Storm – smaller topologies: easy to update and optimize
– match consumer parallelism (number of partitions) to kafka spouts
– tune number of executor threads to optimal performance
• ElasticSearch: – aggregate your writes
– heap size <= 32 GB, turn off swap,
– benefits hugely from high iops use SSDs if you can
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec21
Using Open Source Software : Joy and Frustrations
• Be ready for constant upgrades
– for bug fixes
– to get cool new features: Grafana, Kibana
– for stability, cool stats and visualization: Storm
• InfluxDB clustering maturing
– temporary HA solution (write to 2+ influxDBs)
– waiting for 0.9 release with better clustering
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec22
Eat your own Dog Food
• Volta was a cobbler’s child for a while …
– did not use any system to aggregate logs and metrics!
• Now we are using Volta to collect its logs and metrics
– send logs and metrics from one Volta instance to another
– sending to the same instance is an interesting one!
• Important metrics:
– ingest rate, Storm processing rate, ES / Influx Write latency
– end to end latency of events
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec23
Synthetic Transactions and Tracking SLAs
• Goal: track Service level metrics
– availability to users / business
– latency for operations to users
• Use Synthetic Transactions that exercise a sequence of APIs
– measure success / failure rates
– measure end to end latency
– collect, trend and alert on these
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec24
Deployment & Ops : automate, automate, automate …
• Volta is a collection of services– use separate repos, deploy small changes
• Lots of configuration parameters : manage consistency– performance very sensitive to values
– e.g., Heap, number of workers, etc.
• Performance benchmarking– need to be done for each environment
• CI and Deployment pipeline
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec25
Volta next gen
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec26
Volta Next Gen
• OpenSource Volta
• Refactor Storm– Split into separate metric and log topologies and batch writes
• Move ES and InfluxDB to higher iops storage (SSDs?)
• Multi-DC support via stream duplication
• Archival into Swift / HDFS
• Anomaly detection using CEP / Storm
• HTTP REST API in front of Kafka
• Deployment automation using OpenStack Murano
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec27
Thank you!
Questions, Comments, Suggestions?
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec 28
We are interested in Open Sourcing & Collaborating on Volta.
Interested?
And, we are hiring …. interested? [email protected]
twitter: @lrengan
Backup Slides
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec 29
LMM Metrics Data Model
● name : name of the metric. LMM uses this to store
the metrics and you will use in queries: select
“value” from “load”
● value : value of the metrics at a given time
● @timestamp : time stamp
● host : host name or any other id
● tenant_id : tenant id (keystone)
● apikey : LMM apikey
{"@version": "1","@timestamp": "2014-07-30T00:16:59.000Z","name": "cpu","host": "demo.symcpe.net","plugin_instance": "0","collectd_type": "cpu","type_instance": "interrupt","value": 0,"tenant_id":"db5ca8e4c8514fad9f98dbc4d648ee87","apikey": "26d85ae3-1e10-4ce4-837a-7a1c8dfc67fb"
}
Mandatory fields Sample for metric from collectd
Collectd : name of plugin becomes name of metric. E.g.: cpu or memory
StatsD : users metric name concatenated with metric type by a dot. E.g.: myapp.counter or myapp.gauge
Reserved fields: time, sequence_number Special field: type_instance
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec30