And is it a good thing?

City view by the river
City view by the river
Photo by Lukas Hartmann from Pexels | Branded content disclosure.

Containers have become the de facto standard for moving data projects to production. No more dependency management nightmares— projects developed on a local machine can be “shipped” to a staging and production cluster (typically) with no surprises. Data pipelines and ML models are finally reproducible and can run anywhere in the same fashion.

However, with an ever-growing number of containerized data workloads, orchestration platforms are becoming increasingly important.

1. Orchestrating Containers

If you want to run reproducible data pipelines and ML models that can run anywhere, you probably know that a Docker image is the way to go. …

Real-time technologies are powerful but they add significant complexity to your data architecture

Photo by Victor Wang from Pexels

Real-time data pipelines provide a notable advantage over batch processing — data becomes available to consumers faster. In the traditional ETL, you would not be able to analyze events from today until tomorrow’s nightly jobs would finish. These days, many businesses rely on data being available within minutes, seconds, or even milliseconds. With streaming technologies, we no longer need to wait for scheduled batch jobs to see new data events. Live dashboards are updated automatically as new data comes in.

Despite all the benefits, real-time streaming adds a lot of additional complexity to the overall data processes, tooling, and even…

How to choose a decoupling service that suits your use case

Man looking at mountain
Man looking at mountain
Photo by Wil Stewart on Unsplash | Branded content disclosure.

Decoupling offers a myriad of advantages, but choosing the right tool for the job may be challenging. AWS alone provides several services that allow us to decouple sending and receiving data. While these services seem to provide similar functionality on the surface, they are designed for different use cases and each of them can be useful if applied properly to the problem at hand.


As one of the oldest AWS services, SQS has a track record of providing an extremely simple and effective decoupling mechanism. The entire service is based on sending messages to the queue and allowing for applications…

Factors you need to consider to provide more realistic estimates

Photo by Andrea Piacquadio from Pexels | Branded content disclosure

Imagine that you’re asked to give an estimate for how long a data project will take. How can you realistically assess the timeline for it? When thinking about all the unknowns that we need to take into account, the only correct answer seems to be “It depends!”. There are too many unknowns in nearly every data project which makes most estimates wrong.

1. It all boils down to data availability and quality

If you need to first get data from five different data sources and establish a regular pipeline for all of them before even starting the actual project that you were assigned to do, then a one-day task can…

“Don’t Repeat Yourself” is beneficial— not only in software engineering

Photo by Pixabay from Pexels | Branded content disclosure

Have you ever encountered the same queries being applied over and over again in various dashboards? Or the same KPIs being calculated in nearly every single report? If your answer is yes, you are not alone. It’s common among business users to simply copy-paste the same queries, data definitions, and KPI calculations. But there is a better way.

The dangers of knowledge duplication

Most software engineers are taught from day one the “DRY” principle: Don’t Repeat Yourself. This principle states that:

“Every piece of knowledge must have a single, unambiguous, authoritative representation within a system”. — The Pragmatic Programmer

Even though most data and software…

Often the most impactful changes come from rethinking our processes

Photo by Startup Stock Photos from Pexels | Branded content disclosure

Have you ever heard anyone saying: “Our data is great, we’ve never had any data quality issues"? Ensuring data quality is hard. The magnitude of the problem makes us believe that we need some really big actions to make any improvements. But the reality shows, often the simplest and most intuitive solutions can be incredibly impactful. In this article, we’ll look at one idea to improve the process around data quality, and make it more rewarding and actionable.

Table of contents

· Taking ownership of data
· Making the process more rewarding & easier to track
· Leveraging automation to facilitate the…

Data projects require a different skill set than software engineering projects

Man typing at computer
Man typing at computer
Photo by Anete Lusina from Pexels | Branded content disclosure.

We could start this article by sharing statistics that show how quickly the amount of data we generate every day is growing. We could delve into how Big Data and AI are rapidly changing so many areas of our lives. But we no longer have to explain it. Everybody intuitively knows it.

So what’s missing? The right people with the right skill set.

Data Engineering Is Not Equivalent to Software Engineering or Data Science

Nearly every company these days strives to become more data-driven. To accomplish that, many firms are hiring software engineers under the assumption that the challenge is purely on a technical level. …

Modern data engineering requires automated deployment processes

Photo by Sanaan Mazhar from Pexels | Branded content disclosure

Apache Airflow is a commonly used platform for building data engineering workloads. There are so many ways to deploy Airflow that it’s hard to provide one simple answer on how to build a continuous deployment process. In this article, we’ll focus on S3 as “DAG storage” and demonstrate a simple method to implement a robust CI/CD pipeline.

Table of contents

· Creating Apache Airflow environment on AWS
· Git repository
· Building a simple CI/CD for data pipelines
· Testing a CI/CD process for data pipelines in Airflow
· How can we make the CI/CD pipeline more robust for production?
· How does Buddy handle…

Useful tricks that will save you time when using AWS

Artwork in a museum
Artwork in a museum
Photo by mali maeder from Pexels.

Note: Branded content disclosure

If you work a lot with AWS, you have probably realized that literally everything on AWS is an API call. As such, everything can be automated.

This article will discuss several tricks that will save you time when performing everyday tasks in the AWS cloud. Make sure to read until the end because I saved the most interesting one for last.

1. Use the --dryrun Flag in the AWS CLI Before Performing Any Task on Production Resources

If you ever have to perform some S3 migration tasks or want to make changes to other existing AWS resources via the command line, it’s useful to leverage the --dryrun flag to ensure that your…

How to make Business Intelligence future-proof by applying software engineering principles

Photo by Philipp Birmes from Pexels | Branded content disclosure

The field of Business Intelligence has significantly evolved over the last decades. While in the 1980s, it was considered an umbrella term for all data-driven decision-making activities, these days, it’s most commonly understood as solely the “visualization” and analytics part of the data lifecycle. Therefore, the term “headless BI” seems to be an oxymoron: how can something which inherently serves visualization be headless? The answer is thanks to the API layer. This article will demonstrate a decoupled headless BI stack that can be deployed to a Kubernetes cluster or even just to a Docker container on your local machine.


Anna Geller

Data Engineer, M.Sc. in BI, AWS Certified Solution Architect, HIIT, cloud & tech enthusiast living in Berlin.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store