Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator

Tom Lous
12 min readJan 1, 2020
Hazel Glass https://portlandopenstudios.com/artists/2019-artists/hazel-glass.html
Art by: Hazel Glass

For each challenge there are many technology stacks that can provide the solution. I’m not claiming this approach is the holy grail of data processing, but this more the tale of my quest to combine these widely supported tools in a maintainable fashion.

From the onset I’ve always tried to generate as much configuration as possible, mainly because I’ve experienced it’s easy to drown in a sea of yaml-files, conf-files and incompatible versions in registries, repositories, CI/CD pipelines and deployments.

What I created was a sbt script that, when triggered, builds a fat-jar, which gets wrapped it in a docker-file and turned into an image, whilst also updating the helm chart & values. The image is pushed to the registry, the helm chart is augmented with environmental settings and pushed to chart museum.

I’ve deployed this both locally on minikube as remotely in Azure, but the Azure flow is maybe less generic to discuss in this article. Also remote deployments are relying on terraform scripts and CI/CD pipelines that are too specific anyway. Do note that in this approach all infra is setup via homebrew on a mac. But it should be easy to find equivalents for other environments.

  1. Kubernetes
  2. Helm

--

--

Tom Lous

Freelance Data & ML Engineer | husband + father of 2 | #Spark #Scala #ZIO#BigData #ML #Kafka #Airflow #Kubernetes | Shodan Aikido