AWS defines boto3 as a Python Software Development Kit to create, configure, and manage AWS services. In this article, we’ll look at how boto3 works and how it can help us interact with various AWS services.
Table of contents:
· 1. Boto3 under the hood
· 2. Clients vs. Resources
∘ Why resource is often much easier to use than client
∘ Why you will still use clients for most of your work
· 3. Waiters
∘ Wait until a specific S3 object arrived in S3
· 4. Collections
· 5. Sessions: How to pass IAM credentials to your boto3 code?
∘ How to…
Data lakes used to have a bad reputation when it comes to data quality. In contrast to data warehouses, data doesn’t need to adhere to any predefined schema before we can load it in. Without proper testing and governance, your data lake can easily turn into a data swamp.
In this article, we’ll look at how to build automated data tests that will be executed any time new data is loaded to a data lake. We’ll also configure SNS-based alerting to get notified about data that deviates from our expectations.
Containers have become the de facto standard for moving data projects to production. No more dependency management nightmares— projects developed on a local machine can be “shipped” to a staging and production cluster (typically) with no surprises. Data pipelines and ML models are finally reproducible and can run anywhere in the same fashion.
However, with an ever-growing number of containerized data workloads, orchestration platforms are becoming increasingly important.
If you want to run reproducible data pipelines and ML models that can run anywhere, you probably know that a Docker image is the way to go. …
Real-time data pipelines provide a notable advantage over batch processing — data becomes available to consumers faster. In the traditional ETL, you would not be able to analyze events from today until tomorrow’s nightly jobs would finish. These days, many businesses rely on data being available within minutes, seconds, or even milliseconds. With streaming technologies, we no longer need to wait for scheduled batch jobs to see new data events. Live dashboards are updated automatically as new data comes in.
Despite all the benefits, real-time streaming adds a lot of additional complexity to the overall data processes, tooling, and even…
Decoupling offers a myriad of advantages, but choosing the right tool for the job may be challenging. AWS alone provides several services that allow us to decouple sending and receiving data. While these services seem to provide similar functionality on the surface, they are designed for different use cases and each of them can be useful if applied properly to the problem at hand.
As one of the oldest AWS services, SQS has a track record of providing an extremely simple and effective decoupling mechanism. The entire service is based on sending messages to the queue and allowing for applications…
Imagine that you’re asked to give an estimate for how long a data project will take. How can you realistically assess the timeline for it? When thinking about all the unknowns that we need to take into account, the only correct answer seems to be “It depends!”. There are too many unknowns in nearly every data project which makes most estimates wrong.
If you need to first get data from five different data sources and establish a regular pipeline for all of them before even starting the actual project that you were assigned to do, then a one-day task can…
Have you ever encountered the same queries being applied over and over again in various dashboards? Or the same KPIs being calculated in nearly every single report? If your answer is yes, you are not alone. It’s common among business users to simply copy-paste the same queries, data definitions, and KPI calculations. But there is a better way.
Most software engineers are taught from day one the “DRY” principle: Don’t Repeat Yourself. This principle states that:
“Every piece of knowledge must have a single, unambiguous, authoritative representation within a system”. — The Pragmatic Programmer
Even though most data and software…
Have you ever heard anyone saying: “Our data is great, we’ve never had any data quality issues"? Ensuring data quality is hard. The magnitude of the problem makes us believe that we need some really big actions to make any improvements. But the reality shows, often the simplest and most intuitive solutions can be incredibly impactful. In this article, we’ll look at one idea to improve the process around data quality, and make it more rewarding and actionable.
Table of contents
We could start this article by sharing statistics that show how quickly the amount of data we generate every day is growing. We could delve into how Big Data and AI are rapidly changing so many areas of our lives. But we no longer have to explain it. Everybody intuitively knows it.
So what’s missing? The right people with the right skill set.
Nearly every company these days strives to become more data-driven. To accomplish that, many firms are hiring software engineers under the assumption that the challenge is purely on a technical level. …
Apache Airflow is a commonly used platform for building data engineering workloads. There are so many ways to deploy Airflow that it’s hard to provide one simple answer on how to build a continuous deployment process. In this article, we’ll focus on S3 as “DAG storage” and demonstrate a simple method to implement a robust CI/CD pipeline.
Table of contents
· Creating Apache Airflow environment on AWS
· Git repository
· Building a simple CI/CD for data pipelines
· Testing a CI/CD process for data pipelines in Airflow
· How can we make the CI/CD pipeline more robust for production?
· How does Buddy handle…