CI/CD for Data Engineers

Reliably Deploying Scala Spark containers for Kubernetes with Github Actions

One of the most under-appreciated parts of software engineering is actually deploying your code. There is al lot of focus on building highly scalable data pipelines, but in the end your code has to ‘magically’ transferred from a local machine to a deployable piece of pipeline in the cloud.

van Bree — Le Friedland

In a previous article I’ve discussed building data pipelines in Scala & Spark and deploying them on Kubernetes, well at least deploying them on your local minikube setup for testing purposes.

Most of the time you don’t want to immediately deploy & run these pipelines directly, but make them available as helm charts & images to be deployed by schedulers, like Apache Airflow

So in our case we need a CI/CD setup that versions our code, images and helm charts, pushes them to the appropriate environments (here: DEV, TEST & PROD) and either supply the correct version to the scheduler, or have every environment automatically pick up the latest build.

For this we need to think about how we can create a workflow using our version control (git) so the lifecycle of our features get’s reflected in the correct git status and hooks into the deployment flow.

0. OneFlow
1. Creating new Features
2. Deploying DEV Releases
3. Deploying TEST Releases
4. Deploying PROD Releases
5. Deploying Hotfixes
6. Conclusion
_. Code on GitHub

0. OneFlow

First we, as a team working on the same application, need to adhere to a git branching model, that will allow working on new features, whilst deploying releases and working on hotfixes, so that there is no conflict.

Popular flows are GitFlow and Github Flow . GitFlow is very well know, but overly complicated. Github Flow is quite the opposite and fairly easy to use, but also lacks some rigor & structure that we’ll need to build a reusable deployment pipeline. For our projects we use a flow loosely based on on the OneFlow model by Adam Ruka.

The idea: There is one eternal mainbranch (or, master, default, etc) the code that resides here will be the code that resides in our DEV environment.

1. Creating new Features

Everything we develop as new feature will start by branching of this main branch into a feature/my-feature. The my-feature part can be anything as long as it starts with feature/ Pushing your feature branch to the central repository should trigger some checks and validation, but have no impact on any deployment.

Makefile

To make creating a new branch as low level as possible, a Makefile is created in the project that does all required actions to make this happen. The Makefile is also used for each bash action in the workflow, so that all logic resides in one place.

make create-feature-branch my-new-feature

Will create the correct feature/my-new-feature branch.

It actually does some checks and the runs git checkout -b feature/$(FEATURE) where $(FEATURE) is the first cli param.

Now you can actually start coding your new feature and when you are ready or just want to run some tests when pushing this to your Github remote feature branch.

Workflow

In the .github/workflows folder you can create a new yaml (say main-workflow.yaml ) That will be triggered when you push to your feature branch by starting the file with

name: 'Automatic: On Push'

on:
push:
branches:
- 'feature/**'

Now that this workflow triggers we want a workflow job to setup some environment, do some linting & testing and give an error message on slack when something has failed.

jobs:

build:
name: Build & Test
runs-on: ubuntu-latest

steps:
- name: Check out repository code
uses: actions/checkout@v2
with:
fetch-depth: 0

- name: Setup Java and Scala
uses: olafurpg/setup-scala@v10

- name: Cache sbt
uses: actions/cache@v2
with:
path: |
~/.sbt
~/.ivy2/cache
key: ${{ runner.os }}-sbt-cache-v2-${{ hashFiles('**/*.sbt') }}-${{ hashFiles('project/build.properties') }}

- name: Lint
shell: bash
run: make lint

- name: Test
shell: bash
run: make test-coverage

- name: Codecov
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
directory: target

- name: Slack on error
uses: 8398a7/action-slack@v3
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
with:
status: ${{ job.status }}
fields: repo,message,commit,action,workflow,job,took
if: ${{ failure() }}

First we checkout the project, we setup some Java & Scala and we cache everything so that subsequent runs using the same libraries will not trigger a complete reload. All these steps are done by some of the exhaustive list of available Github Actions in the Marketplace

Then we actually run the lint & the test from the Makefile (which basically call sbt scalastyle & sbt test with some additional parameters)

Followed by uploading the code to codecov.io & slacking an error if something failed in this job. You need to add CODECOV_TOKEN and SLACK_WEBHOOK_URL to your secrets in your repository.

Now you can push this code & workflow to the remote feature branch and see the Github Actions in … action.

.

2. Deploying DEV Releases

Assuming our feature is complete, we want to deploy it to our DEV environment (in the cloud). In our case deploying to DEV means creating a runnable spark image accompanied by a helm chart for the DEV environment.

In this case we are not actually deploying the image, but making it available for tools like Airflow to pick it up and use it in the scheduled batches.

Pull Request

Horrible history
Horrible history
This is not something you want in your history. Hence squash!

To get the code into the main branch we need a PR.

When positively reviewed by peers, squash merge with the main branch. The squash is not compulsory of course, but you will probably not want all of your local commits & messages in your main branch’s git repo history.

Squash is the change to rewrite the entire history of the creation of your feature in on great commit message explaining the functionality & referencing all tickets etc.

As mentioned above, code in the main branch should always reflect the code that is deployed on DEV.

Workflow

So we can extent our workflow in main-workflow.yaml to also include pushing to the main branch

name: 'Automatic: On Push'

on:
push:
branches:
- 'feature/**'
- 'main'

Before we start building and bumping versions we actually want to check if something has changed, besides the version. Bumping the version or merging a release (discussed later) could trigger a push on the main branch and start this process all over in a loop. We also want to make sure not to create 2 or more releases when branches are being merged.

Hence some checks in a pre-build job

check:
name: Prebuild checks
runs-on: ubuntu-latest
outputs:
num_changes:
${{ steps.check1.outputs.num_changes }}

steps:
- name: Check out repository code
uses: actions/checkout@v2
with:
fetch-depth: 0

- name: Check changes
id: check1
shell: bash
env:
SHA_OLD: ${{ github.event.before }}
SHA_NEW: ${{ github.sha }}
run: |
echo ::set-output name=num_changes::$(make check-changes)

- name: Turnstyle (1 at the time)
uses: softprops/turnstyle@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

A lot of new things happen here, first of all the softprops/turnstyle@v1 will make sure one at a time builds are in places

Secondly the make check-changes expects the commit sha of the previous commit and the current commit and returns the number of files changed, besides versions.sbt. In the Makefile this is done via this beauty:

git diff --name-only $(SHA_OLD) $(SHA_NEW) | (grep -v version.sbt || true) | wc -l

This number is propagated as output of the check job and now can be used in the build step.

The build step also has to be expanded by bumping the snaphot version and setting & outputting some parameters needed for the deploy step

build:
name: Build & Test
runs-on: ubuntu-latest
needs: check
if: needs.check.outputs.num_changes > 0
outputs:
modules: ${{ steps.project.outputs.modules }}
version: ${{ steps.vars.outputs.version }}

steps:
- name: Check out repository code

...
- name: Set Project Modules for matrix
id: project
shell: bash
run: echo ::set-output name=modules::$(make list-modules-json)
- name: Bump snapshot (main)
if: github.ref == 'refs/heads/main'
shell: bash
run: make bump-snapshot-and-push
- name: Set variables
id: vars
run: echo ::set-output name=version::$(make version)
- name: Slack on error
...

Versioning

The make bump-snapshot-and-push action does a couple of things, but mainly it bumps the current version in version.sbt , by running sbt bumpSnapshotbased on the release logic in release.sbt to the next DEV release version and also commits & pushes the new version.sbt to the current branch.

Versioning is in accordance with Semantic Versioning standards where creating x.y.z versions for official releases and x.y.z-shorthash-SNAPSHOT versions for DEV releases. The release.sbt uses the sbt-release plugin, modified to our needs.

x stands for major versions. The only way to change those is to manually change the val nextReleaseBump = sbtrelease.Version.Bump.Minor to sbtrelease.Version.Bump.Major creating a major new release in the workflow. For example 1.2.0 => 2.0.0

The y stands for Minor release and is the common number that gets bumped during the TEST/PROD release flow. 1.2.0 => 1.3.0

The z stands for patch releases and is only bumped from the hotfix release flow. 1.2.0 => 1.2.1

All DEV releases will assume the next (minor) version bump and add the short commit hash and the string SNAPSHOT to the end. 1.2.0 => 1.3.0-abcd123-SNAPSHOT

To make things clear I keep the Scala version the same as the docker version the same as the helm chart version. This might lead to unchanged helm charts being pushed as a new version, but in my opinion this makes up for the clarity it brings with it.

For the make bump-snapshot-and-push it just updates the version.sbt with the new version, commits & pushes to main. Not causing a new Github action workflow to trigger!

The make version just returns this newly minted version number to be propagated as variable in the rest of the Github Actions jobs.

Matrix

The only thing we need to start creating our deployments is a list of Scala modules we need to build. For each of the sub projects / modules in the sbt we need a docker image and helm chart deployed.

Github Actions offers nice programmable way to transform lists into a matrix of containers to be deployed in Github Actions. We’ll just use make list-modules-json command to have sbt return a list of all the (buildable) modules available in json array format. To be used in the next job in the strategy section of the job

deploy:
if: github.ref == 'refs/heads/main'
needs: build

name: Build & Deploy Snapshot
runs-on: ubuntu-latest
strategy:
matrix:
module: ${{fromJson(needs.build.outputs.modules)}}

steps:
- name: Check out repository code
...

- name: Setup Java and Scala
...

- name: Cache sbt
...

- name: Container Registry Login
shell: bash
env:
REGISTRY_PASSWORD: ${{ secrets.GITHUB_TOKEN }}
REGISTRY_USERNAME: ${{ github.actor }}
run: make registry-docker-push-login
- name: Dockerize
shell: bash
run: make docker-build ${{ matrix.module }}

- name: Publish Docker Image to Github Container Registry
shell: bash
env:
REGISTRY_OWNER: ${{ github.repository_owner }}
run: make docker-push-registry ${{ matrix.module }}

These actual steps reference matrix.module which will contain one of the values supplied in the matrix.

These 3 make targets will log in into docker registry , build the image and push it to Github package registry (or whatever registry you use, like ACR for Azure).

In the code example I also push helm charts for deployment, but in essence is this the push to DEV part, whatever form that takes in your setup.

Now the code with a new snapshot version is packaged and deployed to DEV.

3. Deploying TEST Releases

Most of our deployment flows don’t have more than 3 environments. DEV & PROD and in between TEST, STAGING or PRE-PROD. Officially they are separate environments with separate functions, but in my experience there is hardly ever need for more 4 or more environments in data engineering. I’m even willing to pose that 2 should be enough, but most setups will at least have 3. The version released to this environment, let’s call it TEST, will be the exact version that will potentially be released to PROD.

This should be reflected in the versioning number (so no more SNAPSHOT) and in the git flow.

Release branch

The point on the main branch we want to release should spawn a release branch. This can be a manual step selecting any commit of the main branch, but in my experience, working in small teams this is 99% of the time just the HEAD of the main branch. e.g. You just’ve build a new feature, deployed it to DEV and now want to start releasing it. According to the OneFlow model the branch should be named after the to be release version (eg. release/2.3.0 ), but since the release branch is short lived, and there can only be one active at the same time (there can’t be multiple versions of your code deployed to TEST), just naming it release will make automation much easier and clearer.

Manual trigger

To make life easy we can trigger this deploy to TEST with a manually triggered github action.

manual trigger from the Github Actions page

How does that work? First we need a new yaml file for this flow. Let’s call it release-workflow.yaml which triggers on a manual workflow_dispatch

name: 'Manual: Start Release'

on:
workflow_dispatch:

jobs:
prepare-release:
name: Prepare release
...
if: github.ref == 'refs/heads/main'

steps:

- name: Delete current release branch
uses: dawidd6/action-delete-branch@v3
continue-on-error: true
with:
github_token: ${{github.token}}
branches: release

- name: Create new release branch
uses: peterjgrainger/action-create-branch@v2.0.1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
branch: release

- name: Check out repository code
uses: actions/checkout@v2
with:
ref: release
fetch-depth: 0


...
- name: Bump Release
shell: bash
run: make bump-release-and-push

- name: Set variables
id: vars
run: |
echo ::set-output name=version::$(make version)
echo ::set-output name=modules::$(make list-modules-json)

A lot is happening here.
First check that the workflow is actually called from the main branch.

Then any old release branch gets removed if it wasn’t before.
(This could be a TEST deployment that never made it to PROD, but also not properly discarded).

Version bump

Then a new release branch is created from the HEAD of main, code checked out and the version gets bumped to a release version, either upgrading the minor or major version. Again: in my code this is managed in the release.sbt via the val nextReleaseBump = sbtrelease.Version.Bump.Minor config.
But it could be implemented as an input parameter in the manual workflow as well.

Currently if the current DEV version was 0.2.0-abcd123-SNAPSHOT the new release branch will have bumped it to 0.2.0 in code and this commit will get tagged with this version.

This new version get’s build and deployed to TEST in the next job in this workflow. This will happen in the same manner as is done in the DEV flow, but now to TEST.

Currently Github Actions doesn’t support code includes, so the steps in this job are almost verbatim a copy of the one in the automated DEV flow.

Deploy to PROD action

This version just build on the release branch, should be the target for deployment to PROD. This could be done automatically after some checks, but I’m not a big fan of automated deployments to PROD.

In many organisations it’s important that a human has to decide to push this version to PROD after some checking has been done in TEST. This person should maybe not even be the developer of the feature.

The already exists a mechanism in github that allows ordained people to review, authorise and accept/reject. This is of course the pull request.

Pull Request to PROD

Now that we have a release branch, it’s fairly straight forward to create a PR from this branch to the main branch. Merging this PR, would code wise only merge a bumped version number in version.sbt and the accompanying git tag, but it could symbolise and trigger a release to PROD.

The PR could contain the release notes, be assigned to stake holders and the repo could be setup that only those people can merge this PR to main.
Also rejecting the PR, should trigger a discard of this version, since it has been deemed unworthy for PROD.

      - name: Find old PR
uses: juliangruber/find-pull-request-action@v1
id: fpr
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
branch: release

- name: Close old PR
if: ${{ steps.fpr.outputs.number > 0 }}
uses: peter-evans/close-pull@v1

with:
pull-request-number: ${{ steps.fpr.outputs.number }}
comment: Auto-closing pull request
delete-branch: false

- name: Create new PR
id: pr
uses: repo-sync/pull-request@v2
with:
source_branch: release
destination_branch: main

pr_title: "Release ${{ steps.vars.outputs.version }} to PROD"
pr_body: "..."
pr_reviewer: "${{ github.actor }}"
pr_assignee: "${{ github.actor }}"
pr_label: "auto-pr,release"
pr_allow_empty: true
github_token: ${{ secrets.GITHUB_TOKEN }}

...

In these steps old PR’s to release branches get discarded (we don’t want to release old versions to PROD) and a new PR is created. This is still part of the manually triggered action.

So no we have released our tagged version to TEST and created a PR to release to PROD

4. Deploying PROD Releases

When everything is running smoothly on TEST, the correct assignees can now merge the current release branch with the main branch, by completing the PR.

This should trigger a build & deploy to PROD based on the last commit on the release branch.

We can create an automatic trigger in our workflow on a closed PR

on:
pull_request:
types: [ closed ]


jobs:
prepare-release:
name: Release to Prod
runs-on: ubuntu-latest


# If merged & pr was tagged release & from a release branch
if: contains(github.event.pull_request.labels.*.name, 'release') && github.event.pull_request.merged == true && github.event.pull_request.head.ref == 'release'
... build-deploy:
name: Build & Deploy to PROD
runs-on: ubuntu-latest
needs: prepare-release
strategy:
matrix:
module: ${{fromJson(needs.prepare-release.outputs.modules)}}

steps:
- name: Check out repository code
uses: actions/checkout@v2
with:
ref: ${{ github.event.pull_request.head.sha }}
fetch-depth: 0
... success:
needs: [ prepare-release, build-deploy ]
name: Notify success
runs-on: ubuntu-latest

steps:
- name: Create release
id: create_release
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: v${{ needs.prepare-release.outputs.version }}
release_name: Release ${{ needs.prepare-release.outputs.version }}
draft: false
prerelease: false

- name: Deploy notification
uses: 8398a7/action-slack@v3
...

- name: Delete current release branch
uses: dawidd6/action-delete-branch@v3
continue-on-error: true
with:
github_token: ${{github.token}}
branches: release

We need to make sure that the closed PR is actually the PROD PR, so we check if it comes from a release branch and it is actually merged. Hence the checks in the prepare-release job.

The build-deploy job has to actually point to the head of the PR, so it doesn’t build merged code.

And finally we want to create a real github ‘release’ based on the tag and clean up the release branch.

Rejected PR / Abandon release

When things don’t run smoothly in TEST, or it’s been decided not to push to PROD, we’d like to cancel the PROD flow. This can be done by just closing the PR. Probably leave a comment why it was abandoned, but this should stop this release.

We’ll add a specific flow in the PR close workflow that handles this route.

name: 'Automatic: Deploy to PROD'

on:
pull_request:
types: [ closed ]

jobs:
prepare-release:
...

build-deploy:
...

success:
...

abandon-release:
name: Abandon Release to Prod
runs-on: ubuntu-latest

# If PR was closed, but not merged
if: contains(github.event.pull_request.labels.*.name, 'release') && github.event.pull_request.merged == false && github.event.pull_request.head.ref == 'release'

steps:
...
- name: Delete tag
shell: bash
run: |
TAG=$(git describe --exact-match ${{ github.event.pull_request.head.sha }})
git tag -d $TAG
git push --delete origin $TAG
git push -v origin :refs/tags/$TAG
- name: Delete current release branch
uses: dawidd6/action-delete-branch@v3
continue-on-error: true
with:
github_token: ${{github.token}}
branches: release

First the abandon-release will only trigger if the PR was closed without merging. It will delete the tag (since that was never released in the wild) and also the release branch.

5. Deploying Hotfixes

The only flow yet to be discussed is the hotfix. Something has gone wrong with the PROD release and now we need to fix it ASAP, without disturbing any other feature work.

We start by creating a local hotfix branch based on the latests tag of the current PROD release.

In the code there is a quick make target that does just that

make create-hotfix-branch

We can now locally make all required changes and push the hotfix branch to github. We actually want this to be treated as jet another feature, so we’ll add this to the main workflow

Main workflow

name: 'Automatic: On Push'

on:
push:
branches:
- 'feature/**'
- 'main'
- 'hotfix'

Since hotfixes don’t have to be merged to main first we want a different notification flow, that suggest we can start releasing this hotfix immediately to TEST

notify:
if: github.ref == 'refs/heads/hotfix'
needs: build
name: Notify hotfix
runs-on: ubuntu-latest

steps:
- name: Hotfix notification
uses: 8398a7/action-slack@v3
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
with:
username: 'github actions'
author_name: ''
icon_emoji: ':github:'
status: ${{ job.status }}
fields:
text: ":eight_pointed_black_star: ${{ github.event.repository.name }} *hotfix* ready for release\n\n:arrow_right: <https://github.com/${{ github.repository }}/actions/workflows/release-workflow.yaml|Start Release Workflow ( hotfix ) >"

Release workflow

When we start the release workflow we can us the same manually triggered action we used for the ‘normal’ release flow and make some small adjustments.

First make sure we start not from the main branch, but from the hotfix branch

Next we’ll need to adjust some logic in this workflow to accommodate the hotfix.

name: 'Manual: Start Release'

on:
workflow_dispatch:

jobs:
prepare-release:
...

if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/hotfix'

steps:
- name: Set release type
id: type
run: |
if [ "$REF" == "refs/heads/main" ]
then
echo "::set-output name=branch::release"
else
echo "::set-output name=branch::hotfix"
fi
env:
REF: ${{ github.ref }}

- name: Delete current release branch
uses: dawidd6/action-delete-branch@v3
if: steps.type.outputs.branch == 'release'
...

- name: Create new release branch
uses: peterjgrainger/action-create-branch@v2.0.1
if: steps.type.outputs.branch == 'release'
...

- name: Check out repository code
uses: actions/checkout@v2
with:
ref: ${{ steps.type.outputs.branch }}
fetch-depth: 0
- name: Bump Release
if: steps.type.outputs.branch == 'release'
shell: bash
run: make bump-release-and-push

- name: Bump Hotfix
if: steps.type.outputs.branch == 'hotfix'
shell: bash
run: make bump-patch-and-push

...

We need to do a few small changes in some of the steps, otherwise the flow will remain exactly the same.

One thing that is different is the bumping of the version number. To signify a hotfix, we’ll bump the patch version of the current prod release.
In our case we branched from v0.2.0, so the next hotfix version will be v0.2.1

In the end we’ll end up with another PR from our hotfix branch to our main branch and a released hotfix version on our TEST environment.

Prod release

If we (squash) merge this hotfix branch we want it to be treated just like merging the release branch.

name: 'Automatic: Deploy to PROD'

on:
pull_request:
types: [ closed ]

jobs:
prepare-release:
...

# If merged & pr was tagged release & from a release branch
if: contains(github.event.pull_request.labels.*.name, 'release') && github.event.pull_request.merged == true && (github.event.pull_request.head.ref == 'release' || github.event.pull_request.head.ref == 'hotfix')

steps:
...


build-deploy:
...

success:
...

steps:
...
- name: Delete current release/hotfix branch
uses: dawidd6/action-delete-branch@v3
continue-on-error: true
with:
github_token: ${{github.token}}
branches: ${{ github.event.pull_request.head.ref}}

This will release this version to prod, remove the hotfix branch, create a release and notify of all what has happened.

6. Conclusion

As we have seen setting up this workflow is maybe not the most straightforward approach, but I can tell from experience it works pretty great.

The commit tree is very clean. Versions and tags make sense and at any point you can see what is deployed where. It also makes releasing not as painful.

One thing we also learned quickly is to move all non CI related instructions into make files. This way migrating from github to gitlab or jenkins only impacts workflow files and not the business logic of testing, building, bumping and deploying.

This workflow may not suit all needs for all teams, and perhaps cannot handle some edge cases, but for 99% of the work this has been a great experience, without compromising in the flexibility of releasing.

This article has dived deep into the flow of DEV, TEST & PROD. Feel free to checkout the code to inspect the actual building of containers & images.

Code:

Freelance Data & ML Engineer | husband + father of 2 | #Spark #Scala #BigData #ML #DeepLearning #Airflow #Kubernetes | Shodan Aikido

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store