In this article, we’ll discuss potential pitfalls that we came across when configuring ECS task definitions. While considering this AWS-specific container management platform, we’ll also examine some general best practices for working with containers in production.
Table of contents:
· #1 Wrong logging configuration
· #2 Failing to enable “Auto-assign public IP”
· #3 Storing credentials in plain text in the ECS task definition
∘ #1. Using AWS Systems Manager Parameter Store
∘ #2. Using AWS Secrets Manager
∘ What’s the difference between AWS Systems Manager Parameter Store and AWS Secrets Manager?
· #4 Using the same IAM task…
Engineering time is a scarce resource. We often have to balance many tasks and often conflicting priorities. However, there are some activities for which allocating more of that time can be beneficial. In this article, we’ll look at ten of them.
Have you ever deleted something prematurely only to figure out that there is no backup? A good rule of thumb is to check three times before deleting anything. This may involve cross-checking if we are in the right environment, region, database schema, or S3 bucket.
Additionally, there are many ways of mitigating the impact of unintentional deletions:
During the re:invent in 2017, Amazon’s VP & CTO, Werner Vogels, made a bold statement: he claimed that all the code we will ever write in the future is business logic.
Back then, many of us were skeptical, but looking at the current developments, especially in the data engineering and analytics space, this quote might hold true.
As long as you are not a technology company, the chances are that maintaining internally developed tools not directly tied to a concrete business objective (expressed by business logic) may no longer be necessary.
In fact, it may even be detrimental in the…
AWS Simple Storage Service (S3) is by far the most popular service on AWS. The simplicity and scalability of S3 made it a go-to platform not only for storing objects, but also to host them as static websites, serve ML models, provide backup functionality, and so much more. It became the simplest solution for event-driven processing of images, video, and audio files, and even matured to a de-facto replacement of Hadoop for big data processing. In this article, we’ll look at various ways to leverage the power of S3 in Python.
Table of contents:
Designing a data model for analytics is not the same as doing it for transactional processing. You optimize for access patterns that are very different from row-level data retrieval used in OLTP systems. In this article, we’ll look at the most common pitfalls when designing schemas and tables for analytics.
Building data assets is an ongoing process. As your analytical needs change over time, the schema will have to be adjusted as well. Treating data modeling as a one-off activity is unrealistic. …
AWS defines boto3 as a Python Software Development Kit to create, configure, and manage AWS services. In this article, we’ll look at how boto3 works and how it can help us interact with various AWS services.
Table of contents:
· 1. Boto3 under the hood
· 2. Clients vs. Resources
∘ Why resource is often much easier to use than client
∘ Why you will still use clients for most of your work
· 3. Waiters
∘ Wait until a specific S3 object arrived in S3
· 4. Collections
· 5. Sessions: How to pass IAM credentials to your boto3 code?
∘ How to…
Data lakes used to have a bad reputation when it comes to data quality. In contrast to data warehouses, data doesn’t need to adhere to any predefined schema before we can load it in. Without proper testing and governance, your data lake can easily turn into a data swamp.
In this article, we’ll look at how to build automated data tests that will be executed any time new data is loaded to a data lake. We’ll also configure SNS-based alerting to get notified about data that deviates from our expectations.
Containers have become the de facto standard for moving data projects to production. No more dependency management nightmares— projects developed on a local machine can be “shipped” to a staging and production cluster (typically) with no surprises. Data pipelines and ML models are finally reproducible and can run anywhere in the same fashion.
However, with an ever-growing number of containerized data workloads, orchestration platforms are becoming increasingly important.
If you want to run reproducible data pipelines and ML models that can run anywhere, you probably know that a Docker image is the way to go. …
Real-time data pipelines provide a notable advantage over batch processing — data becomes available to consumers faster. In the traditional ETL, you would not be able to analyze events from today until tomorrow’s nightly jobs would finish. These days, many businesses rely on data being available within minutes, seconds, or even milliseconds. With streaming technologies, we no longer need to wait for scheduled batch jobs to see new data events. Live dashboards are updated automatically as new data comes in.
Despite all the benefits, real-time streaming adds a lot of additional complexity to the overall data processes, tooling, and even…
Decoupling offers a myriad of advantages, but choosing the right tool for the job may be challenging. AWS alone provides several services that allow us to decouple sending and receiving data. While these services seem to provide similar functionality on the surface, they are designed for different use cases and each of them can be useful if applied properly to the problem at hand.
As one of the oldest AWS services, SQS has a track record of providing an extremely simple and effective decoupling mechanism. The entire service is based on sending messages to the queue and allowing for applications…