Containers have become the de facto standard for moving data projects to production. No more dependency management nightmares— projects developed on a local machine can be “shipped” to a staging and production cluster (typically) with no surprises. Data pipelines and ML models are finally reproducible and can run anywhere in the same fashion.
However, with an ever-growing number of containerized data workloads, orchestration platforms are becoming increasingly important.
If you want to run reproducible data pipelines and ML models that can run anywhere, you probably know that a Docker image is the way to go. …
Real-time data pipelines provide a notable advantage over batch processing — data becomes available to consumers faster. In the traditional ETL, you would not be able to analyze events from today until tomorrow’s nightly jobs would finish. These days, many businesses rely on data being available within minutes, seconds, or even milliseconds. With streaming technologies, we no longer need to wait for scheduled batch jobs to see new data events. Live dashboards are updated automatically as new data comes in.
Despite all the benefits, real-time streaming adds a lot of additional complexity to the overall data processes, tooling, and even…
Decoupling offers a myriad of advantages, but choosing the right tool for the job may be challenging. AWS alone provides several services that allow us to decouple sending and receiving data. While these services seem to provide similar functionality on the surface, they are designed for different use cases and each of them can be useful if applied properly to the problem at hand.
As one of the oldest AWS services, SQS has a track record of providing an extremely simple and effective decoupling mechanism. The entire service is based on sending messages to the queue and allowing for applications…
Imagine that you’re asked to give an estimate for how long a data project will take. How can you realistically assess the timeline for it? When thinking about all the unknowns that we need to take into account, the only correct answer seems to be “It depends!”. There are too many unknowns in nearly every data project which makes most estimates wrong.
If you need to first get data from five different data sources and establish a regular pipeline for all of them before even starting the actual project that you were assigned to do, then a one-day task can…
Have you ever encountered the same queries being applied over and over again in various dashboards? Or the same KPIs being calculated in nearly every single report? If your answer is yes, you are not alone. It’s common among business users to simply copy-paste the same queries, data definitions, and KPI calculations. But there is a better way.
Most software engineers are taught from day one the “DRY” principle: Don’t Repeat Yourself. This principle states that:
“Every piece of knowledge must have a single, unambiguous, authoritative representation within a system”. — The Pragmatic Programmer
Even though most data and software…
Have you ever heard anyone saying: “Our data is great, we’ve never had any data quality issues"? Ensuring data quality is hard. The magnitude of the problem makes us believe that we need some really big actions to make any improvements. But the reality shows, often the simplest and most intuitive solutions can be incredibly impactful. In this article, we’ll look at one idea to improve the process around data quality, and make it more rewarding and actionable.
Table of contents
We could start this article by sharing statistics that show how quickly the amount of data we generate every day is growing. We could delve into how Big Data and AI are rapidly changing so many areas of our lives. But we no longer have to explain it. Everybody intuitively knows it.
So what’s missing? The right people with the right skill set.
Nearly every company these days strives to become more data-driven. To accomplish that, many firms are hiring software engineers under the assumption that the challenge is purely on a technical level. …
Apache Airflow is a commonly used platform for building data engineering workloads. There are so many ways to deploy Airflow that it’s hard to provide one simple answer on how to build a continuous deployment process. In this article, we’ll focus on S3 as “DAG storage” and demonstrate a simple method to implement a robust CI/CD pipeline.
Table of contents
· Creating Apache Airflow environment on AWS
· Git repository
· Building a simple CI/CD for data pipelines
· Testing a CI/CD process for data pipelines in Airflow
· How can we make the CI/CD pipeline more robust for production?
· How does Buddy handle…
If you work a lot with AWS, you have probably realized that literally everything on AWS is an API call. As such, everything can be automated.
This article will discuss several tricks that will save you time when performing everyday tasks in the AWS cloud. Make sure to read until the end because I saved the most interesting one for last.
If you ever have to perform some S3 migration tasks or want to make changes to other existing AWS resources via the command line, it’s useful to leverage the
--dryrun flag to ensure that your…
The field of Business Intelligence has significantly evolved over the last decades. While in the 1980s, it was considered an umbrella term for all data-driven decision-making activities, these days, it’s most commonly understood as solely the “visualization” and analytics part of the data lifecycle. Therefore, the term “headless BI” seems to be an oxymoron: how can something which inherently serves visualization be headless? The answer is thanks to the API layer. This article will demonstrate a decoupled headless BI stack that can be deployed to a Kubernetes cluster or even just to a Docker container on your local machine.