A deep dive into boto3 and how AWS built it

Photo by Kindel Media from Pexels

AWS defines boto3 as a Python Software Development Kit to create, configure, and manage AWS services. In this article, we’ll look at how boto3 works and how it can help us interact with various AWS services.

Table of contents:

· 1. Boto3 under the hood
· 2. Clients vs. Resources
Why resource is often much easier to use than client
Why you will still use clients for most of your work
· 3. Waiters
Wait until a specific S3 object arrived in S3
· 4. Collections
· 5. Sessions: How to pass IAM credentials to your boto3 code?
How to…


Ensure data quality in your S3 data lake using Python, AWS Lambda, SNS, and Great Expectations

Chart, magnifying glass and eye glasses on table
Chart, magnifying glass and eye glasses on table
Photo by Anna Nekrashevich from Pexels | Branded content disclosure.

Data lakes used to have a bad reputation when it comes to data quality. In contrast to data warehouses, data doesn’t need to adhere to any predefined schema before we can load it in. Without proper testing and governance, your data lake can easily turn into a data swamp.

In this article, we’ll look at how to build automated data tests that will be executed any time new data is loaded to a data lake. We’ll also configure SNS-based alerting to get notified about data that deviates from our expectations.

Table of Contents

· Python libraries for data quality
· Using Great Expectations
· Using…


And is it a good thing?

City view by the river
City view by the river
Photo by Lukas Hartmann from Pexels | Branded content disclosure.

Containers have become the de facto standard for moving data projects to production. No more dependency management nightmares— projects developed on a local machine can be “shipped” to a staging and production cluster (typically) with no surprises. Data pipelines and ML models are finally reproducible and can run anywhere in the same fashion.

However, with an ever-growing number of containerized data workloads, orchestration platforms are becoming increasingly important.

1. Orchestrating Containers

If you want to run reproducible data pipelines and ML models that can run anywhere, you probably know that a Docker image is the way to go. …


Real-time technologies are powerful but they add significant complexity to your data architecture

Photo by Victor Wang from Pexels

Real-time data pipelines provide a notable advantage over batch processing — data becomes available to consumers faster. In the traditional ETL, you would not be able to analyze events from today until tomorrow’s nightly jobs would finish. These days, many businesses rely on data being available within minutes, seconds, or even milliseconds. With streaming technologies, we no longer need to wait for scheduled batch jobs to see new data events. Live dashboards are updated automatically as new data comes in.

Despite all the benefits, real-time streaming adds a lot of additional complexity to the overall data processes, tooling, and even…


How to choose a decoupling service that suits your use case

Man looking at mountain
Man looking at mountain
Photo by Wil Stewart on Unsplash | Branded content disclosure.

Decoupling offers a myriad of advantages, but choosing the right tool for the job may be challenging. AWS alone provides several services that allow us to decouple sending and receiving data. While these services seem to provide similar functionality on the surface, they are designed for different use cases and each of them can be useful if applied properly to the problem at hand.

AWS SQS

As one of the oldest AWS services, SQS has a track record of providing an extremely simple and effective decoupling mechanism. The entire service is based on sending messages to the queue and allowing for applications…


Factors you need to consider to provide more realistic estimates

Photo by Andrea Piacquadio from Pexels | Branded content disclosure

Imagine that you’re asked to give an estimate for how long a data project will take. How can you realistically assess the timeline for it? When thinking about all the unknowns that we need to take into account, the only correct answer seems to be “It depends!”. There are too many unknowns in nearly every data project which makes most estimates wrong.

1. It all boils down to data availability and quality

If you need to first get data from five different data sources and establish a regular pipeline for all of them before even starting the actual project that you were assigned to do, then a one-day task can…


“Don’t Repeat Yourself” is beneficial— not only in software engineering

Photo by Pixabay from Pexels | Branded content disclosure

Have you ever encountered the same queries being applied over and over again in various dashboards? Or the same KPIs being calculated in nearly every single report? If your answer is yes, you are not alone. It’s common among business users to simply copy-paste the same queries, data definitions, and KPI calculations. But there is a better way.

The dangers of knowledge duplication

Most software engineers are taught from day one the “DRY” principle: Don’t Repeat Yourself. This principle states that:

“Every piece of knowledge must have a single, unambiguous, authoritative representation within a system”. — The Pragmatic Programmer

Even though most data and software…


Often the most impactful changes come from rethinking our processes

Photo by Startup Stock Photos from Pexels | Branded content disclosure

Have you ever heard anyone saying: “Our data is great, we’ve never had any data quality issues"? Ensuring data quality is hard. The magnitude of the problem makes us believe that we need some really big actions to make any improvements. But the reality shows, often the simplest and most intuitive solutions can be incredibly impactful. In this article, we’ll look at one idea to improve the process around data quality, and make it more rewarding and actionable.

Table of contents

· Taking ownership of data
· Making the process more rewarding & easier to track
· Leveraging automation to facilitate the…


Data projects require a different skill set than software engineering projects

Man typing at computer
Man typing at computer
Photo by Anete Lusina from Pexels | Branded content disclosure.

We could start this article by sharing statistics that show how quickly the amount of data we generate every day is growing. We could delve into how Big Data and AI are rapidly changing so many areas of our lives. But we no longer have to explain it. Everybody intuitively knows it.

So what’s missing? The right people with the right skill set.

Data Engineering Is Not Equivalent to Software Engineering or Data Science

Nearly every company these days strives to become more data-driven. To accomplish that, many firms are hiring software engineers under the assumption that the challenge is purely on a technical level. …


Modern data engineering requires automated deployment processes

Photo by Sanaan Mazhar from Pexels | Branded content disclosure

Apache Airflow is a commonly used platform for building data engineering workloads. There are so many ways to deploy Airflow that it’s hard to provide one simple answer on how to build a continuous deployment process. In this article, we’ll focus on S3 as “DAG storage” and demonstrate a simple method to implement a robust CI/CD pipeline.

Table of contents

· Creating Apache Airflow environment on AWS
· Git repository
· Building a simple CI/CD for data pipelines
· Testing a CI/CD process for data pipelines in Airflow
· How can we make the CI/CD pipeline more robust for production?
· How does Buddy handle…

Anna Geller

Data Engineer, M.Sc. in BI, AWS Certified Solution Architect, HIIT, cloud & tech enthusiast living in Berlin. www.annageller.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store