spark streaming in k8s with argocd & spark operator · 2020. 11. 25. · data engineer @ schibsted...

Spark Streaming in K8s with ArgoCD & Spark OperatorAlbert Franzi - Data Engineer Lead @ Typeform

Agenda

val sc: SparkContextWhere are we nowadays

Spark(implicit mode:K8s)When Spark met K8s

type Deploy=SparkOperatorHow we deploy into K8s

Some[Learnings]Why it matters

About meData Engineer Lead @ Typeform

About me

Data Engineer Lead @ Typeform○ Leading the Data Platform team

Previously○ Data Engineer @ Alpha Health○ Data Engineer @ Schibsted Classified Media ○ Data Engineer @ Trovit Search

albert-franzi FranziCros

http://typeform.comhttps://www.alpha.company/https://schibsted.com/https://www.trovit.es/https://medium.com/albert-franzihttps://medium.com/albert-franzihttps://twitter.com/FranziCroshttps://twitter.com/FranziCros

About Typeform

val sc: SparkContextWhere are we nowadays

val sc: SparkContextWhere are we nowadays - Environments

val sc: SparkContextWhere are we nowadays - Executions

Great for batch processing

Good orchestrators

Old school Area 51 Next slides

Spark(implicit mode:K8s)When Spark met K8s

● Delayed EMR releases

EMR 6.1.0 Spark 3.0.0 after ~3 months.

● Spark fixed version per cluster.

● Unused resources.

● Same IAM role shared across the entire cluster.

Spark(implicit mode:K8s)When Spark met K8s - EMR : The Past

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html

● Multiple Spark versions running in parallel in the same cluster.

● Use what you need, share what you don’t.

● IAM role per Service Account.

● Different node types based on your needs.

● You define the dockers.

Spark(implicit mode:K8s)When Spark met K8s - The future

Spark(implicit mode:K8s)When Spark met K8s - Requirements

Kubernetes Cluster

v : 1.13+

AWS SDK

v : 1.788+🔗 WebIdentityTokenCredentialsProvider

IAM Roles

Fine-grained IAM roles for service accounts🔗 IRSA

Spark docker image

hadoop : v3.2.1aws_sdk: v1.11.788scala: v2.12spark: v3.0.0java: 8🔗 hadoop.Dockerfile & spark.Dockerfile

https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/WebIdentityTokenCredentialsProvider.htmlhttps://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/https://gist.github.com/afranzi/4685518e24fd81e07639b97c4a5a2757https://gist.github.com/afranzi/85ff3bf47632fc650cec17b0cc16bbca


ref: github.com - spark-on-k8s-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

type Deploy=SparkOperatorHow we deploy into K8s - Application Specs

apiVersion: "sparkoperator.k8s.io/v1beta2"kind: SparkApplicationmetadata: name: our-spark-job-name namespace: sparkspec: type: Scala mode: cluster image: "xxx/typeform/spark:3.0.0" imagePullPolicy: Always imagePullSecrets: [xxx] sparkVersion: "3.0.0" restartPolicy: type: Never volumes: - name: temp-volume emptyDir: {} hadoopConf: fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider mainClass: com.typeform.data.spark.our.class.package mainApplicationFile: "s3a://my_spark_bucket/spark_jars/0.8.23/data-spark-jobs-assembly-0.8.23.jar" arguments: - --argument_name_1 - argument_value_1

driver: cores: 1 coreLimit: "1000m" memory: "512m" labels: version: 3.0.0 serviceAccount: "spark" deleteOnTermination: true secrets: - name: my-secret secretType: generic path: /mnt/secrets volumeMounts: - name: "temp-volume" mountPath: "/tmp" executor: cores: 1 instances: 4 memory: "512m" labels: version: 3.0.0 serviceAccount: "spark" deleteOnTermination: true volumeMounts: - name: "temp-volume" mountPath: "/tmp"


schedule: "@every 5m"concurrencyPolicy

Replace

Allow

Forbid

crontab.guru

https://crontab.guruhttp://crontab.guru


Never AlwaysOnFailure

restartPolicy

type Deploy=SparkOperatorHow we deploy into K8s - Deployment Flow

type Deploy=SparkOperatorHow we deploy into K8s - Deploying it manually (Simple & easy)

$ sbt assembly

$ aws s3 cp \ target/scala-2.12/data-spark-jobs-assembly-0.8.23.jar \ s3://my_spark_bucket/spark_jars/data-spark-jobs_2.12/0.8.23/

$ kubectl apply -f spark-job.yaml

Build the jar, put it into S3 and deploy the Spark Application

$ kubectl delete -f spark-job.yaml

Delete our Spark Application

type Deploy=SparkOperatorHow we deploy into K8s - Deploying it automatically (Simple & easy)

Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes.

ref: argoproj.github.io/argo-cd

https://argoproj.github.io/argo-cd/

type Deploy=SparkOperatorHow we deploy into K8s - Deploying it automatically (Simple & easy)

apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: data-spark-jobs namespace: argocdspec: destination: namespace: spark server: 'https://kubernetes.default.svc' project: data-platform-projects source: helm: valueFiles: - values.yaml - values.prod.yaml path: k8s/data-spark-jobs repoURL: 'https://github.com/thereponame' targetRevision: HEAD syncPolicy: {}

Argo CD Application Spec

ArgoCD manual Sync

type Deploy=SparkOperatorHow we deploy into K8s - Deployment Flow

Some[Learnings]Why it matters

Some[Learnings]

● It was really easy to set up with the right team and the right infrastructure.

● Different teams & Projects adopting new Spark versions with their own pace.

● Spark Testing Cluster always ready to accept new jobs without “paying for it”. -- K8s cluster already available in dev environments.

● Monitor the pods consumption to tune their memory and cpu properly.

Why it matters

Some[Learnings]Why it matters : Data Devops makes a difference

Adopt a Devops in your team and convert it into a Data Devops.

The[team]

Digital Analytics Specialists (x2)

BI / DWH Architect (x2)

Data Devops (x1)

Data engineers (x4)

Data Platform : A multidisciplinary team

Links of Interest

Spark structured streaming in K8s with ArgoCD by Albert Franzi

Spark on K8s operator

ArgoCD - App of apps pattern

Spark History Server in K8s by Carlos Escura

Spark Operator - Specs
https://medium.com/albert-franzi/spark-structured-streaming-in-k8s-with-argo-cd-de4942846161https://www.linkedin.com/in/albertfranzi/https://medium.com/albert-franzi/spark-structured-streaming-in-k8s-with-argo-cd-de4942846161https://github.com/GoogleCloudPlatform/spark-on-k8s-operatorhttps://github.com/GoogleCloudPlatform/spark-on-k8s-operatorhttps://argoproj.github.io/argo-cd/operator-manual/cluster-bootstrapping/#app-of-apps-patternhttps://argoproj.github.io/argo-cd/operator-manual/cluster-bootstrapping/#app-of-apps-patternhttps://medium.com/@carlosescura/run-spark-history-server-on-kubernetes-using-helm-7b03bfed20f6https://www.linkedin.com/in/carlosescura/https://medium.com/@carlosescura/run-spark-history-server-on-kubernetes-using-helm-7b03bfed20f6https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.mdhttps://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.md

spark streaming in k8s with argocd & spark operator · 2020. 11. 25. · data engineer @ schibsted...

Documents