spark streaming in k8s with argocd & spark operator · 2020. 11. 25. · data engineer @ schibsted...
TRANSCRIPT
-
Spark Streaming in K8s with ArgoCD & Spark OperatorAlbert Franzi - Data Engineer Lead @ Typeform
-
Agenda
val sc: SparkContextWhere are we nowadays
Spark(implicit mode:K8s)When Spark met K8s
type Deploy=SparkOperatorHow we deploy into K8s
Some[Learnings]Why it matters
-
About meData Engineer Lead @ Typeform
-
About me
Data Engineer Lead @ Typeform○ Leading the Data Platform team
Previously○ Data Engineer @ Alpha Health○ Data Engineer @ Schibsted Classified Media ○ Data Engineer @ Trovit Search
albert-franzi FranziCros
http://typeform.comhttps://www.alpha.company/https://schibsted.com/https://www.trovit.es/https://medium.com/albert-franzihttps://medium.com/albert-franzihttps://twitter.com/FranziCroshttps://twitter.com/FranziCros
-
About Typeform
-
val sc: SparkContextWhere are we nowadays
-
val sc: SparkContextWhere are we nowadays - Environments
-
val sc: SparkContextWhere are we nowadays - Executions
Great for batch processing
Good orchestrators
Old school Area 51 Next slides
-
Spark(implicit mode:K8s)When Spark met K8s
-
● Delayed EMR releases
EMR 6.1.0 Spark 3.0.0 after ~3 months.
● Spark fixed version per cluster.
● Unused resources.
● Same IAM role shared across the entire cluster.
Spark(implicit mode:K8s)When Spark met K8s - EMR : The Past
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html
-
● Multiple Spark versions running in parallel in the same cluster.
● Use what you need, share what you don’t.
● IAM role per Service Account.
● Different node types based on your needs.
● You define the dockers.
Spark(implicit mode:K8s)When Spark met K8s - The future
-
Spark(implicit mode:K8s)When Spark met K8s - Requirements
Kubernetes Cluster
v : 1.13+
AWS SDK
v : 1.788+🔗 WebIdentityTokenCredentialsProvider
IAM Roles
Fine-grained IAM roles for service accounts🔗 IRSA
Spark docker image
hadoop : v3.2.1aws_sdk: v1.11.788scala: v2.12spark: v3.0.0java: 8🔗 hadoop.Dockerfile & spark.Dockerfile
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/WebIdentityTokenCredentialsProvider.htmlhttps://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/https://gist.github.com/afranzi/4685518e24fd81e07639b97c4a5a2757https://gist.github.com/afranzi/85ff3bf47632fc650cec17b0cc16bbca
-
type Deploy=SparkOperatorHow we deploy into K8s
-
type Deploy=SparkOperatorHow we deploy into K8s
ref: github.com - spark-on-k8s-operator
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
-
type Deploy=SparkOperatorHow we deploy into K8s - Application Specs
apiVersion: "sparkoperator.k8s.io/v1beta2"kind: SparkApplicationmetadata: name: our-spark-job-name namespace: sparkspec: type: Scala mode: cluster image: "xxx/typeform/spark:3.0.0" imagePullPolicy: Always imagePullSecrets: [xxx] sparkVersion: "3.0.0" restartPolicy: type: Never volumes: - name: temp-volume emptyDir: {} hadoopConf: fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider mainClass: com.typeform.data.spark.our.class.package mainApplicationFile: "s3a://my_spark_bucket/spark_jars/0.8.23/data-spark-jobs-assembly-0.8.23.jar" arguments: - --argument_name_1 - argument_value_1
driver: cores: 1 coreLimit: "1000m" memory: "512m" labels: version: 3.0.0 serviceAccount: "spark" deleteOnTermination: true secrets: - name: my-secret secretType: generic path: /mnt/secrets volumeMounts: - name: "temp-volume" mountPath: "/tmp" executor: cores: 1 instances: 4 memory: "512m" labels: version: 3.0.0 serviceAccount: "spark" deleteOnTermination: true volumeMounts: - name: "temp-volume" mountPath: "/tmp"
-
type Deploy=SparkOperatorHow we deploy into K8s
schedule: "@every 5m"concurrencyPolicy
Replace
Allow
Forbid
crontab.guru
https://crontab.guruhttp://crontab.guru
-
type Deploy=SparkOperatorHow we deploy into K8s
Never AlwaysOnFailure
restartPolicy
-
type Deploy=SparkOperatorHow we deploy into K8s - Deployment Flow
-
type Deploy=SparkOperatorHow we deploy into K8s - Deploying it manually (Simple & easy)
$ sbt assembly
$ aws s3 cp \ target/scala-2.12/data-spark-jobs-assembly-0.8.23.jar \ s3://my_spark_bucket/spark_jars/data-spark-jobs_2.12/0.8.23/
$ kubectl apply -f spark-job.yaml
Build the jar, put it into S3 and deploy the Spark Application
$ kubectl delete -f spark-job.yaml
Delete our Spark Application
-
type Deploy=SparkOperatorHow we deploy into K8s - Deploying it automatically (Simple & easy)
Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes.
ref: argoproj.github.io/argo-cd
https://argoproj.github.io/argo-cd/
-
type Deploy=SparkOperatorHow we deploy into K8s - Deploying it automatically (Simple & easy)
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: data-spark-jobs namespace: argocdspec: destination: namespace: spark server: 'https://kubernetes.default.svc' project: data-platform-projects source: helm: valueFiles: - values.yaml - values.prod.yaml path: k8s/data-spark-jobs repoURL: 'https://github.com/thereponame' targetRevision: HEAD syncPolicy: {}
Argo CD Application Spec
-
ArgoCD manual Sync
-
type Deploy=SparkOperatorHow we deploy into K8s - Deployment Flow
-
Some[Learnings]Why it matters
-
Some[Learnings]
● It was really easy to set up with the right team and the right infrastructure.
● Different teams & Projects adopting new Spark versions with their own pace.
● Spark Testing Cluster always ready to accept new jobs without “paying for it”. -- K8s cluster already available in dev environments.
● Monitor the pods consumption to tune their memory and cpu properly.
Why it matters
-
Some[Learnings]Why it matters : Data Devops makes a difference
Adopt a Devops in your team and convert it into a Data Devops.
-
The[team]
Digital Analytics Specialists (x2)
BI / DWH Architect (x2)
Data Devops (x1)
Data engineers (x4)
Data Platform : A multidisciplinary team
-
Links of Interest
Spark structured streaming in K8s with ArgoCD by Albert Franzi
Spark on K8s operator
ArgoCD - App of apps pattern
Spark History Server in K8s by Carlos Escura
Spark Operator - Specs
https://medium.com/albert-franzi/spark-structured-streaming-in-k8s-with-argo-cd-de4942846161https://www.linkedin.com/in/albertfranzi/https://medium.com/albert-franzi/spark-structured-streaming-in-k8s-with-argo-cd-de4942846161https://github.com/GoogleCloudPlatform/spark-on-k8s-operatorhttps://github.com/GoogleCloudPlatform/spark-on-k8s-operatorhttps://argoproj.github.io/argo-cd/operator-manual/cluster-bootstrapping/#app-of-apps-patternhttps://argoproj.github.io/argo-cd/operator-manual/cluster-bootstrapping/#app-of-apps-patternhttps://medium.com/@carlosescura/run-spark-history-server-on-kubernetes-using-helm-7b03bfed20f6https://www.linkedin.com/in/carlosescura/https://medium.com/@carlosescura/run-spark-history-server-on-kubernetes-using-helm-7b03bfed20f6https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.mdhttps://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.md