Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
For each challenge there are many technology stacks that can provide the solution. I’m not claiming this approach is the holy grail of data processing, but this more the tale of my quest to combine these widely supported tools in a maintainable fashion.
From the onset I’ve always tried to generate as much configuration as possible, mainly because I’ve experienced it’s easy to drown in a sea of yaml-files, conf-files and incompatible versions in registries, repositories, CI/CD pipelines and deployments.
What I created was a sbt script that, when triggered, builds a fat-jar, which gets wrapped it in a docker-file and turned into an image, whilst also updating the helm chart & values. The image is pushed to the registry, the helm chart is augmented with environmental settings and pushed to chart museum.
I’ve deployed this both locally on minikube as remotely in Azure, but the Azure flow is maybe less generic to discuss in this article. Also remote deployments are relying on terraform scripts and CI/CD pipelines that are too specific anyway. Do note that in this approach all infra is setup via homebrew on a mac. But it should be easy to find equivalents for other environments.
- Image Registry
- Helm Chart Museum
- Spark Operator
- Spark App
- sbt setup
- Base Image setup
- Helm config
I am not a DevOps expert and the purpose of this article is not to discuss all options for kubernetes, so I will setup a vanilla minikube here, but rest assured that this writeup should be independent of what kubernetes setup you use. So if you don’t have it already: Install minikube and accompanying tools we will need. VirtualBox will be needed to run minikube on, but installing this may sometimes be not as simple as described below, read more about setting this up.
brew cask install minikube
brew cask install VirtualBox
brew install kubernetes-cli