native support of prometheus monitoring in apache spark 3...spark 3 provides a better integration...
TRANSCRIPT
Native Support of Prometheus Monitoring in Apache Spark 3
Dongjoon HyunDB Tsai
SPARK+AI SUMMIT 2020
Who am IDongjoon Hyun
Apache Spark PMC and CommitterApache ORC PMC and CommitterApache REEF PMC and Committer
https://github.com/dongjoon-hyunhttps://www.linkedin.com/in/dongjoon@dongjoonhyun
Who am IDB Tsai
Apache Spark PMC and CommitterApache SystemML PMC and CommitterApache Yunikorn CommitterApache Bahir Committer
https://github.com/dbtsaihttps://www.linkedin.com/in/dbtsai@dbtsai
Three popular methodsMonitoring Apache Spark
Web UI (Live and History Server)• Jobs, Stages, Tasks, SQL queries• Executors, Storage
Logs• Event logs and Spark process logs• Listeners (SparkListener, StreamingQueryListener, SparkStatusTracker, …)
Metrics• Various numeric values
Early warning instead of post-mortem processMetrics are useful to handle gray failures
Monitoring and alerting Spark jobs’ gray failures• Memory Leak or misconfiguration• Performance degradation• Growing streaming job’s inter-states
An open-source systems monitoring and alerting toolkit Prometheus
Provides• a multi-dimensional data model• operational simplicity• scalable data collection• a powerful query language
A good option for Apache Spark Metrics
Prometheus Server
Prometheus Web UI
Alert Manager
Pushgateway
https://en.wikipedia.org/wiki/Prometheus_(software)
Using JmxSink and JMXExporter combinationSpark 2 with Prometheus (1/3)
Enable Spark’s built-in JmxSink in Spark’s conf/metrics.propertiesDeploy Prometheus' JMXExporter library and its config fileExpose JMXExporter port, 9404, to PrometheusAdd `-javaagent` option to the target (master/worker/executor/driver/…)
-javaagent:./jmx_prometheus_javaagent-0.12.0.jar=9404:config.yaml
Using GraphiteSink and GraphiteExporter combinationSpark 2 with Prometheus (2/3)
Set up Graphite serverEnable Spark’s built-in Graphite Sink with several configurationsEnable Prometheus’ GraphiteExporter at Graphite
Custom sink (or 3rd party Sink) + Pushgateway serverSpark 2 with Prometheus (3/3)
Set up Pushgateway serverDevelop a custom sink (or use 3rd party libs) with Prometheus dependencyDeploy the sink libraries and its configuration file to the cluster
Pros and Cons
Pros• Used already in production• A general approach
Cons• Difficult to setup at new environments• Some custom libraries may have a dependency on Spark versions
Easy usageGoal in Apache Spark 3
Be independent from the existing Metrics pipeline• Use new endpoints and disable it by default• Avoid introducing new dependency
Reuse the existing resources• Use official documented ports of Master/Worker/Driver• Take advantage of Prometheus Service Discovery in K8s as much as possible
What's new in Spark 3 Metrics
SPARK-29674 / SPARK-29557DropWizard Metrics 4 for JDK11
Timeline
2.3 3.02.41.6 2.1 2.22.04.1.13.1.53.1.2DropWizard Metrics
Spark
20202019201820172016Year
DropWizard Metrics 4.x (Spark 3)
SPARK-29674 / SPARK-29557DropWizard Metrics 4 for JDK11
Timeline
DropWizard Metrics 3.x (Spark 1/2)
metrics_master_workers_Value 0.0 metrics_master_workers_Value{type="gauges",} 0.0metrics_master_workers_Number{type=“gauges",} 0.0
2.3 3.02.41.6 2.1 2.22.04.1.13.1.53.1.2DropWizard Metrics
Spark
20202019201820172016Year
A new metric sourceExecutorMetricsSource
Collect executor memory metrics to driver and expose it as ExecutorMetricsSource and REST API (SPARK-23429, SPARK-27189, SPARK-27324, SPARK-24958)
• JVMHeapMemory / JVMOffHeapMemory
• OnHeapExecutionMemory / OffHeapExecutionMemory
• OnHeapStorageMemory / OffHeapStorageMemory
• OnHeapUnifiedMemory / OffHeapUnifiedMemory
• DirectPoolMemory / MappedPoolMemory
• MinorGCCount / MinorGCTime
• MajorGCCount / MajorGCTime
• ProcessTreeJVMVMemory
• ProcessTreeJVMRSSMemory
• ProcessTreePythonVMemory
• ProcessTreePythonRSSMemory
• ProcessTreeOtherVMemory
• ProcessTreeOtherRSSMemory
JVM Process Tree
Prometheus-format endpointsSupport Prometheus more natively (1/2)
PrometheusServlet: A friend of MetricSevlet • A new metric sink supporting Prometheus-format (SPARK-29032)• Unified way of configurations via conf/metrics.properties • No additional system requirements (services / libraries / ports)
Prometheus-format endpointsSupport Prometheus more natively (1/2)
PrometheusServlet: A friend of MetricSevlet • A new metric sink supporting Prometheus-format (SPARK-29032)• Unified way of configurations via conf/metrics.properties • No additional system requirements (services / libraries / ports)
PrometheusResource: A single endpoint for all executor memory metrics • A new metric endpoint to export all executor metrics at driver (SPARK-29064/SPARK-29400)• The most efficient way to discover and collect because driver has all information already• Enabled by `spark.ui.prometheus.enabled` (default: false)
spark_info and service discoverySupport Prometheus more natively (2/2)
Add spark_info metric (SPARK-31743) • A standard Prometheus way to expose
version and revision• Monitoring Spark jobs per version
Support driver service annotation in K8S (SPARK-31696)• Used by Prometheus service discovery
Under the hood
SPARK-29032 Add PrometheusServlet to monitor Master/Worker/DriverPrometheusServlet
Make Master/Worker/Driver expose the metrics in Prometheus format at the existing portFollow the output style of "Spark JMXSink + Prometheus JMXExporter + javaagent" way
Port Prometheus Endpoint (New in 3.0) JSON Endpoint (Since initial release)
Driver 4040 /metrics/prometheus/ /metrics/json/
Worker 8081 /metrics/prometheus/ /metrics/json/
Master 8080 /metrics/master/prometheus/ /metrics/master/json/
Master 8080 /metrics/applications/prometheus/ /metrics/applications/json/
Spark Driver Endpoint Example
Use conf/metrics.properties like the other sinksPrometheusServlet Configuration
Copy conf/metrics.properties.template to conf/metrics.propertiesUncomment like the following in conf/metrics.properties
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet*.sink.prometheusServlet.path=/metrics/prometheusmaster.sink.prometheusServlet.path=/metrics/master/prometheusapplications.sink.prometheusServlet.path=/metrics/applications/prometheus
SPARK-29064 Add PrometheusResource to export executor metricsPrometheusResource
New endpoint with the similar information of JSON endpointDriver exposes all executor memory metrics in Prometheus format
Port Prometheus Endpoint (New in 3.0) JSON Endpoint (Since 1.4)
Driver 4040 /metrics/executors/prometheus/ /api/v1/applications/{id}/executors/
Use spark.ui.prometheus.enabledPrometheusResource Configuration
Run spark-shell with configuration
Run `curl` with the new endpoint
$ bin/spark-shell \-c spark.ui.prometheus.enabled=true \-c spark.executor.processTreeMetrics.enabled=true
$ curl http://localhost:4040/metrics/executors/prometheus/ | grep executor | head -n1metrics_executor_rddBlocks{application_id="...", application_name="...", executor_id="..."} 0
Monitoring in K8s cluster
Key Monitoring Scenarios on K8s clusters
Monitoring batch job memory behaviorMonitoring dynamic allocation behaviorMonitoring streaming job behavior
Key Monitoring Scenarios on K8s clusters
Monitoring batch job memory behaviorMonitoring dynamic allocation behaviorMonitoring streaming job behavior
=> A risk to be killed?=> Unexpected slowness?=> Latency?
Use Prometheus Service DiscoveryMonitoring batch job memory behavior (1/2)
Configuration Value
spark.ui.prometheus.enabled true
spark.kubernetes.driver.annotation.prometheus.io/scrape true
spark.kubernetes.driver.annotation.prometheus.io/path /metrics/executors/prometheus/
spark.kubernetes.driver.annotation.prometheus.io/port 4040
Monitoring batch job memory behavior (2/2)spark-submit --master k8s://$K8S_MASTER --deploy-mode cluster \ -c spark.driver.memory=2g \ -c spark.executor.instances=30 \ -c spark.ui.prometheus.enabled=true \ -c spark.kubernetes.driver.annotation.prometheus.io/scrape=true \ -c spark.kubernetes.driver.annotation.prometheus.io/path=/metrics/executors/prometheus/ \ -c spark.kubernetes.driver.annotation.prometheus.io/port=4040 \ -c spark.kubernetes.container.image=spark:3.0.0 \ --class org.apache.spark.examples.SparkPi \ local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar 200000
OOMKilled at Driver
Set spark.dynamicAllocation.*Monitoring dynamic allocation behavior
spark-submit --master k8s://$K8S_MASTER --deploy-mode cluster \ -c spark.dynamicAllocation.enabled=true \ -c spark.dynamicAllocation.executorIdleTimeout=5 \ -c spark.dynamicAllocation.shuffleTracking.enabled=true \ -c spark.dynamicAllocation.maxExecutors=50 \ -c spark.ui.prometheus.enabled=true \ … (the same) … https://gist.githubusercontent.com/dongjoon-hyun/.../dynamic-pi.py 10000
Set spark.dynamicAllocation.*Monitoring dynamic allocation behavior
spark-submit --master k8s://$K8S_MASTER --deploy-mode cluster \ -c spark.dynamicAllocation.enabled=true \ -c spark.dynamicAllocation.executorIdleTimeout=5 \ -c spark.dynamicAllocation.shuffleTracking.enabled=true \ -c spark.dynamicAllocation.maxExecutors=50 \ -c spark.ui.prometheus.enabled=true \ … (the same) … https://gist.githubusercontent.com/dongjoon-hyun/.../dynamic-pi.py 10000
`dynamic-pi.py` computes Pi, sleeps 1 minutes, and computes Pi again.
Select a single Spark app
rate(metrics_executor_totalTasks_total{...}[1m])
Inform Prometheus both metrics endpointsDriver service annotation
spark-submit --master k8s://$K8S_MASTER --deploy-mode cluster \-c spark.ui.prometheus.enabled=true \-c spark.kubernetes.driver.annotation.prometheus.io/scrape=true \-c spark.kubernetes.driver.annotation.prometheus.io/path=/metrics/prometheus/ \-c spark.kubernetes.driver.annotation.prometheus.io/port=4040 \-c spark.kubernetes.driver.service.annotation.prometheus.io/scrape=true \-c spark.kubernetes.driver.service.annotation.prometheus.io/path=/metrics/executors/prometheus/ \-c spark.kubernetes.driver.service.annotation.prometheus.io/port=4040 \…
spark.dynamicAllocation.maxExecutors=30
spark.dynamicAllocation.maxExecutors=300
Executor Allocation Ratio
Set spark.sql.streaming.metricsEnabled=true (default: false)Monitoring streaming job behavior (1/2)
Metrics• latency• inputRate-total• processingRate-total• states-rowsTotal• states-usedBytes• eventTime-watermark
Prefix of streaming query metric names• metrics_[namespace]_spark_streaming_[queryName]•
All metrics are important for alertMonitoring streaming job behavior (2/2)
latency > micro-batch interval • Spark can endure some situations, but the job needs to be re-design to prevent future
outage
states-rowsTotal grows indefinitely • These jobs will die eventually due to OOM
- SPARK-27340 Alias on TimeWindow expression cause watermark metadata lost (Fixed at 3.0)- SPARK-30553 Fix structured-streaming java example error
Separation of concernsPrometheus Federation and Alert
Prometheus Server
Prometheus Web UI
Alert Manager
Pushgateway
namespace1 (User)
… Prometheus Server
Prometheus Web UI
Alert Manager
Pushgateway
namespace2 (User)
Prometheus Server
Prometheus Web UI
Alert Manager
Pushgateway
Cluster-wise prometheus (Admin)
Metrics for batch job monitoring Metrics for streaming job monitoring
a subset of metrics (spark_info + ...)
New endpoints are still experimentalLimitations and Tips
New endpoints expose only Spark metrics starting with `metrics_` or `spark_info`• `javaagent` method can expose more metrics like `jvm_info`
PrometheusSevlet does not follow Prometheus naming convention• Instead, it's designed to follow Spark 2 naming convention for consistency in Spark
The number of metrics grows if we don't set the followings
writeStream.queryName("spark")
spark.metrics.namespace=spark
Summary
Spark 3 provides a better integration with Prometheus monitoring• Especially, in K8s environment, the metric collections become much easier than Spark 2
New Prometheus style endpoints are independent and additional options• Users can migrate into new endpoints or use them with the existing methods in a mixed
way
Thank you!