Serverless Real-Time Data Pipelines on AWS with Prefect, ECS and GitHub Actions

A guide to fully automated serverless real-time data pipelines

Anna Geller
15 min readJul 25, 2022

--

Most data platforms these days are still operated using batch processing. Even though streaming technology matured, building automated and reliable real-time data pipelines is still difficult and often requires a team of engineers to operate the underlying platform. But it doesn’t have to be that way. We wrote about it already last year.

In this post, we’ll get more hands-on. You’ll see how to turn any batch processing Python script into a real-time data pipeline orchestrated by Prefect. We’ll deploy the real-time streaming flow to a serverless containerized service running on AWS ECS Fargate — all resources will be deployed with Infrastructure as Code (leveraging CloudFormation), and the deployment process can be triggered with a single click from a GitHub Actions workflow.

With a CI/CD template, we’ll then ensure that future changes will be automatically redeployed with no manual intervention and no downtime.

Table of contents:1. Why Prefect 2.0 for real-time data pipelines?
Drawbacks of batch processing
Opinion: why you likely don’t need a distributed message queue
How can Prefect 2.0 handle such low-latency real-time workflows?
Benefits of moving towards real-time workflows with Prefect 2.0
Why can I not just run a single DAG 24/7?
2. Demo time!
Prerequisites
Typical batch-processing flow
Turn it into a streaming service
What if something goes wrong?
3. Getting value from real-time: take automated action!
Using Prefect Blocks to store key-value pairs
Validate the data
Conclusion on a local demo
4. Deploy the real-time data pipeline as a serverless container
Configure repository secrets
Deploy the entire infrastructure in a single click
Observe the real-time data pipelines in your Prefect UI
5. Automate future deployments with CI/CD
Making changes to the code
(Optional) One flow run gets stuck in a Running state
What happens when there are infrastructure issues?
Limitations of the approach presented in

--

--

Anna Geller

Data Engineering, AWS Cloud, Serverless & .py. Get my articles via email https://annageller.medium.com/subscribe YouTube: https://www.youtube.com/@anna__geller

Recommended from Medium

Lists