CI/CD for Data Engineers

Tom Lous
17 min readFeb 26, 2021

Reliably Deploying Scala Spark containers for Kubernetes with Github Actions

One of the most under-appreciated parts of software engineering is actually deploying your code. There is al lot of focus on building highly scalable data pipelines, but in the end your code has to ‘magically’ transferred from a local machine to a deployable piece of pipeline in the cloud.

van Bree — Le Friedland

In a previous article I’ve discussed building data pipelines in Scala & Spark and deploying them on Kubernetes, well at least deploying them on your local minikube setup for testing purposes.

Most of the time you don’t want to immediately deploy & run these pipelines directly, but make them available as helm charts & images to be deployed by schedulers, like Apache Airflow

So in our case we need a CI/CD setup that versions our code, images and helm charts, pushes them to the appropriate environments (here: DEV, TEST & PROD) and either supply the correct version to the scheduler, or have every environment automatically pick up the latest build.

For this we need to think about how we can create a workflow using our version control (git) so the lifecycle of our features get’s reflected in the correct git status and hooks into the deployment flow.

0. OneFlow
1. Creating new Features
2. Deploying DEV Releases
3. Deploying TEST Releases
4. Deploying PROD Releases
5. Deploying Hotfixes
6. Conclusion
_. Code on GitHub

0. OneFlow

First we, as a team working on the same application, need to adhere to a git branching model, that will allow working on new features, whilst deploying releases and working on hotfixes, so that there is no conflict.

Popular flows are GitFlow and Github Flow . GitFlow is very well know, but overly complicated. Github Flow is quite the opposite and fairly easy to use, but also lacks some rigor & structure that we’ll need to build a reusable deployment pipeline. For our projects we use a flow loosely based on on the OneFlow model by Adam Ruka.

--

--

Tom Lous

Freelance Data & ML Engineer | husband + father of 2 | #Spark #Scala #ZIO#BigData #ML #Kafka #Airflow #Kubernetes | Shodan Aikido