Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator

Hazel Glass https://portlandopenstudios.com/artists/2019-artists/hazel-glass.html
Hazel Glass https://portlandopenstudios.com/artists/2019-artists/hazel-glass.html

For each challenge there are many technology stacks that can provide the solution. I’m not claiming this approach is the holy grail of data processing, but this more the tale of my quest to combine these widely supported tools in a maintainable fashion.

From the onset I’ve always tried to generate as much configuration as possible, mainly because I’ve experienced it’s easy to drown in a sea of yaml-files, conf-files and incompatible versions in registries, repositories, CI/CD pipelines and deployments.

What I created was a sbt script that, when triggered, builds a fat-jar, which gets wrapped it in a docker-file and turned into an image, whilst also updating the helm chart & values. The image is pushed to the registry, the helm chart is augmented with environmental settings and pushed to chart museum.

I’ve deployed this both locally on minikube as remotely in Azure, but the Azure flow is maybe less generic to discuss in this article. Also remote deployments are relying on terraform scripts and CI/CD pipelines that are too specific anyway. Do note that in this approach all infra is setup via homebrew on a mac. But it should be easy to find equivalents for other environments.

  1. Kubernetes
  2. Helm
  3. Image Registry
  4. Helm Chart Museum
  5. Spark Operator
  6. Spark App
  7. sbt setup
  8. Base Image setup
  9. Helm config
  10. Deploying
  11. Conclusion

1. Kubernetes

I am not a DevOps expert and the purpose of this article is not to discuss all options for kubernetes, so I will setup a vanilla minikube here, but rest assured that this writeup should be independent of what kubernetes setup you use. So if you don’t have it already: Install minikube and accompanying tools we will need. VirtualBox will be needed to run minikube on, but installing this may sometimes be not as simple as described below, read more about setting this up.

brew cask install minikube
brew cask install VirtualBox
brew install kubernetes-cli

Next we want to start up minikube, but keep in mind that we want to run some Spark Jobs, use RBAC and an image registry. So we’ll adjust the startup specs from there.

minikube addons enable dashboard
minikube addons enable registry
minikube start \
--cpus 4 \
--memory 8192 \
--extra-config=apiserver.authorization-mode=RBAC \
--insecure-registry=192.168.0.0/16 \
--kubernetes-version=1.16.2

Check if everything is ok by running:

kubectl cluster-info

It should mention something like:

Kubernetes master is running at https://192.168.99.100:8443
KubeDNS is running at
https://192.168.99.100:8443/api/v1/namespaces/kube-system/services/kube-dns/proxy

2. Helm

There are a lot of discussions whether helm is useful or yet another layer of yaml abstraction with their own templating engine. As engineer (read: non-devops) it seems to me the best alternative versus writing a whole bunch of docker & config files. It also allows me to template spark deployments so that only a small number of variables are needed to distinguish between environments. I’ll explain more when we get there. For now we just setup helm.

We are using Helm v3 so we don’t need to install tiller in any of our namespaces.

So we just install the cli tools.

brew install helmhelm plugin install https://github.com/chartmuseum/helm-push

That is it for now.

3. Image Registry

In the end we want sbt to create a docker image that can be deployed on kubernetes. This image and also its base image have to be stored in an image registry. Of course many options are available in the cloud, but to keep it simple and generic we’ll use the registry provided with minikube.

So disclaimer: You should not use a local kubernetes registry for production, but I like pragmatism and this article is not about how to run an image registry for kubernetes. We just need a place to push and pull images. If you have access to dockerhub, ACR or any other stable and secure solution, please use that.

We will make sure we are using minikube’s docker for all subsequent commands:

eval $(minikube docker-env)

We should have the registry running

docker ps | grep registry

This means we can tag our images as
([MINIKUBE_IP]:5000)/[IMAGE_NAME]:[IMAGE_TAG] and push them to this registry and also pull from there using this setup.

You may do a small test by pushing some image to the registry and see if it shows up. I used docker images to see what images I had available

curl -s $(minikube ip):5000/v2/_catalog | jq

Which should at this moment show something like:

{
"repositories": []
}

4. Helm Chart Museum

The next tool we want to have running in our environment is a chart museum, which is nothing more than a repository for helm charts. Our deployments are going to quite complicated and involved with extensive configuration, even with the spark-operator, which abstracts away a lot. Much of which can be bundled in a prefab helm chart with only a few configurations dependant on environment and user provided.

Anyway. Again this can be deployed as a kubernetes deployment, setting up PVC storage etc etc, but for ease of use we are just going to deploy the chart museum as a docker in this tutorial. This is not what you should do in production!

First we’ll again create a location to permanently store the chart on the machine.

mkdir -p ~/.minikube/files/chart-data

Then we will deploy the chart museum. More info about the parameters can be found in the docs

docker run -d \
--name chart-museum \
--restart=always \
-p 8080:8080 \
-v ~/.minikube/files/chart-data:/charts \
-e DEBUG=true \
-e STORAGE=local \
-e STORAGE_LOCAL_ROOTDIR=/charts \
chartmuseum/chartmuseum:v0.11.0

Again we do a simple test if our chart museum is up and running

curl $(minikube ip):8080/index.yaml

Which should output something like:

apiVersion: v1
entries: {}
generated: "2019-12-27T11:18:14Z"
serverInfo: {}

5. Spark Operator

Our final piece of infrastructure is the most important part. We are going to install a spark operator on kubernetes that will trigger on deployed SparkApplications and spawn an Apache Spark cluster as collection of pods in a specified namespace. The cluster will use kubernetes as resource negotiator instead of YARN. More info about this can be found in the Spark docs

To have the spark operator be able to create/destroy pods it needs elevated privilege and should be run in a different namespace as the deployed SparkApplications

First we need to create the 2 namespaces. One for the operator and one for the apps. Technically one is sufficient, but I like to keep a separation between the pods with elevated privileges of the spark-operator and the applications themselves.

kubectl create -f files/namespaces-spark.yaml

Where the namespaces-spark.yaml is:

apiVersion: v1
kind: Namespace
metadata:
name: spark-apps
---
apiVersion: v1
kind: Namespace
metadata:
name: spark-operator

Next we have to create a service account with some RBAC elevated privileges

kubectl create serviceaccount spark \
--namespace=spark-operator
kubectl create clusterrolebinding spark-operator-role \
--clusterrole=edit \
--serviceaccount=spark-operator:spark \
--namespace=spark-operator

Now we have the ecosystem setup for the spark operator which we can install by first adding an incubator repo (because none of this is stable, yet) and then running helm install with some helm config. Our helm deployment we just call spark

helm repo add incubator \
http://storage.googleapis.com/kubernetes-charts-incubator

helm install spark \
incubator/sparkoperator \
--version 0.6.1 \
--skip-crds \
--namespace spark-operator \
--set enableWebhook=true,sparkJobNamespace=spark-apps,logLevel=3

It takes a while to get setup, but at some point your spark-operator namespace should look like (run kubectl get all -n spark-operator )

Note that the --skip-crds is used here to prevent a known bug, but might/should be removed in later versions.

NAME                                       READY   STATUS      RESTARTS   AGE
pod/spark-sparkoperator-6786d8fcc5-9mtfs 1/1 Running 0 98s
pod/spark-sparkoperator-init-mqmck 0/1 Completed 0 98s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/spark-webhook ClusterIP 10.103.114.93 <none> 443/TCP 98s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/spark-sparkoperator 1/1 1 1 98s
NAME DESIRED CURRENT READY AGE
replicaset.apps/spark-sparkoperator-6786d8fcc5 1 1 1 98s
NAME COMPLETIONS DURATION AGE
job.batch/spark-sparkoperator-init 1/1 29s 98s 443/TCP 14m

6. Spark App

Now that we have an infrastructure setup to run our Spark Applications on the next task is to create Spark Apps that can be deployed on kubernetes.

In essence this is the least interesting part of this article and should be replaced with your own spark job that needs to be deployed.

But for all intents and purposes I’ve created a small Spark job that reads some files does some distributive computing and stores the results as parquet in a different folder.

As a ‘big’ sample dataset we can use the MovieLens 20M dataset with 20 million ratings for 27,000 movies

We’ll supply an input folder & output folder to the Spark Job and calculate the average rating for each movie. Usually you ‘d want to define config files for this instead of arguments, but again, this is not the purpose of this post.

The code is not spectacular, but just the bare minimum to get some distributed data processing that doesn’t finish in 1 second. Do note that our master definition is set to be optional. This is important for the kubernetes deployments.

To deploy this job, we want to build a fat-jar with all external dependencies wrapped inside, except from Spark, which will be supplied by the image.

All in all the only thing we have to remember about this job is that it requires 2 arguments: 1. The input path to the extracted MovieLens dataset, 2 the target output path for the parquet

7. sbt setup

Of course we want to add some scaffolding in our build.sbt to build these fancy Docker images that can be used by kubernetes.

Now we want to define the specification of the fat jar. There are many options, but we’ll keep it to the bare essentials for now. We do want to specify the domain. Not especially for the assembly, but since we have it we use it.
We also move the jar to the output folder, so we can link it easily from the docker later on.
Make sure the assembly sbt plugin is enabled.

Next we define the ‘dockerfile’ config. There are a couple of docker plugins for sbt, but Marcus Lonnberg’s sbt-docker is most flexible for our purpose.

Notice that the Docker uses a yet undefined base image localhost:5000/spark-runner. Which we will create in a next step. There is nothing preventing you from using any other image that is compatible with the spark-operator, but creating your own gives the most control over versions and bundled dependencies.

Now that we have a docker setup we need to create an accompanying helm chart.

A Helm chart consists of a Chart.yaml containing the meta information about the chart and a values.yaml that contains information that can be used in the template. Usually these are created / maintained by hand, but since all this data also exists within the build.sbt we will just have sbt create these when we dockerize our application.

This will create the files helm/Chart.yamland helm/values.yaml
We now just have to define the project to call this function on every build

The runLocalSettings are added to compile the sbt locally and ignore the provided qualifier.

At this point you could essentially run sbt "runMain xyz.graphiq.BasicSparkJob dataset/ml-20m /tmp/movieratings" from the root of the project with the basic sbt settings, to run the Spark App locally.

8. Base Image setup

To test the sbt setup we need to create a base image that has the correct entry point for the spark operator and the correct dependencies to run our spark application. To be fair there are many reasonable images we could use here, but I’ve found that at some point it’s useful to have fine grained control over certain library versions. For example the Dockerfile we will be using will create a spark 2.4.4 image (based on gcr.io/spark-operator/spark:v2.4.4) with a scala 2.12 & hadoop 3 dependency (not standard) and also a fix for a spark/kubernetes bug.

Do note there is some custom tinkering in this config. In my case I needed needed SPARK_DIST_CLASSPATH and/or SPARK_EXTRA_CLASSPATH to be set correctly before the spark context started to get Hadoop to load correctly. The only thing that worked for me is to add it to the spark-env.sh ,which always gets run before load in Spark. That's the only spark config in there, though. Other custom Spark configuration should be loaded via the sparkConf in the helm chart.

To create this image and make it available in our registry we run:

name="spark-runner"
registry="localhost:5000"
version="0.1"

docker build \
--build-arg VCS_REF=$(git rev-parse --short HEAD) \
--build-arg BUILD_DATE=$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--build-arg VERSION=0.1 \
-t ${registry}/${name}:${version} . \
&& docker push ${registry}/${name}:${version}

We check if the image is present in our minikube registry, by running
docker images | grep spark-runner

And we should see something like:

localhost:5000/spark-runner                    0.1                   5e3c6186313a        2 minutes ago       927MB

Now we should be able to run our sbt docker command to create our application image

sbt clean docker

Resulting in an image with 2 tags: graphiq/transform-movie-ratings:latest and graphiq/transform-movie-ratings:0.1

And two generated helm/Charts.yaml and helm/values.yaml

9. Helm config

Now that we have our image with our code as fat-jar and all Spark (and other) dependencies bundled and the generated charts and values we have everything we need to specify a kubernetes deployment for our app.

To define a custom SparkApplication resource on kubernetes we need a template to be used by helm based on the configuration needed for the spark operator

The template below is quite verbose, but that makes it also quite flexible for different kind of deployments. Many of these features we can use to create tailored deployments for each environment.

To specify deployment options for each environment we create a custom values.yaml file for each environment. For this post it will be just minikube, resulting in values-minikube.yaml but you could define multiple configs and have your CI/CD push the correct yaml config to the correct helm deployment.

In our case we don’t want to pull the image (it will be present in the registry of minikube), we only need a small cluster (so our machine will not blow up) and we want to pass two directories as arguments. The first one containing the csv-files, the second one a path to write the parquet to.

10. Deploying

Now that we have all the infrastructure setup and config created we can finally create a deployment for our application.

First we mount the correct folders:

mkdir /tmp/parquetminikube mount ./dataset/ml-20m/:/input-data &
minikube mount /tmp/parquet:/output-data &

In our CI/CD we would want to create a helm package per environment

name="transform-movie-ratings"

mkdir output/${name}
cp -r helm/ output/${name}/
cat helm/values-minikube.yaml >> output/${name}/values.yaml
cd output

helm repo add chartmuseum http://$(minikube ip):8080
helm push ${name}
helm repo update

This will create a bundled chart that can be deployed in each environment.

helm upgrade movie-ratings-transform \
chartmuseum/transform-movie-ratings \
--namespace=spark-apps \
--install \
--force

Locally we can also bypass the chartmuseum by running just:

helm upgrade movie-ratings-transform \
./helm \
-f ./helm/values-minikube.yaml \
--namespace=spark-apps \
--install \
--force

This will create a release named movie-ratings-transform using the generated data in helm and the accompanying template for the spark application, with added config from the environmental specific file for minikube. Resulting in output, like:

Release "movie-ratings-transform" has been upgraded. Happy Helming!
NAME: movie-ratings-transform
LAST DEPLOYED: Wed Jan 1 20:20:47 2020
NAMESPACE: spark-apps
STATUS: deployed
REVISION: 8
TEST SUITE: None

Kubernetes will create a driver container, creating a Spark Application that will request, in our case 2 executors to run our app.

At some point there will be 3 pods in our cluster and 2 services:

kubectl -n spark-apps get all

Showing:

NAME                                               READY   STATUS    RESTARTS   AGE
pod/movie-ratings-transform-1577906450455-exec-1 1/1 Running 0 65s
pod/movie-ratings-transform-1577906450455-exec-2 1/1 Running 0 65s
pod/movie-ratings-transform-driver 1/1 Running 0 81s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/movie-ratings-transform-1577906450455-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP 80s
service/movie-ratings-transform-ui-svc ClusterIP 10.105.183.192 <none> 4040/TCP 80s

To access the UI we can do a port forward to the driver

kubectl port-forward -n spark-apps movie-ratings-transform-driver 4041:4040

The cluster runs until completion and then the executors will get removed, leaving only a completed driver pod to retrieve logs from.

11. Conclusion

In the end this seems like a lot of work to deploy a simple spark application, but there are some distinct advantages to this approach:

  • You are not bound to a specific static cluster to deploy everything on, but a cluster tuned to the specific needs of the app.
  • You can choose specific dependencies, library versions and maintain those independently
  • You can deploy on any kubernetes cluster, cloud agnostic
  • When using auto scaling node pools, you can have fine grained control over used resources (no costly VM’s running unnecessarily)
  • All config is centralised in the repo, most even in the build.sbt

I understand this is a lot of information and a lot of steps, which took me quite some time to figure out and fine tune, but I’m quite pleased with the end result.

In a future posts I’ll discuss the CI/CD more and explain how to trigger these deployments using Airflow.

All code is available on github https://github.com/TomLous/medium-spark-k8s

Please reach out if you have any questions, suggestions or you want me to talk about this.

Freelance Data & ML Engineer | husband + father of 2 | #Spark #Scala #BigData #ML #DeepLearning #Airflow #Kubernetes | Shodan Aikido

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store