CI/CD for Data Engineers

Tom Lous
17 min readFeb 26, 2021

Reliably Deploying Scala Spark containers for Kubernetes with Github Actions

One of the most under-appreciated parts of software engineering is actually deploying your code. There is al lot of focus on building highly scalable data pipelines, but in the end your code has to ‘magically’ transferred from a local machine to a deployable piece of pipeline in the cloud.

van Bree — Le Friedland

In a previous article I’ve discussed building data pipelines in Scala & Spark and deploying them on Kubernetes, well at least deploying them on your local minikube setup for testing purposes.

Most of the time you don’t want to immediately deploy & run these pipelines directly, but make them available as helm charts & images to be deployed by schedulers, like Apache Airflow

So in our case we need a CI/CD setup that versions our code, images and helm charts, pushes them to the appropriate environments (here: DEV, TEST & PROD) and either supply the correct version to the scheduler, or have every environment automatically pick up the latest build.

For this we need to think about how we can create a workflow using our version control (git) so the lifecycle of our features get’s reflected in the correct git status and hooks into the deployment flow.

--

--

Tom Lous

Freelance Data & ML Engineer | husband + father of 2 | #Spark #Scala #ZIO#BigData #ML #Kafka #Airflow #Kubernetes | Shodan Aikido