Serverless Real-Time Data Pipelines on AWS with Prefect, ECS and GitHub Actions
A guide to fully automated serverless real-time data pipelines
--
Most data platforms these days are still operated using batch processing. Even though streaming technology matured, building automated and reliable real-time data pipelines is still difficult and often requires a team of engineers to operate the underlying platform. But it doesn’t have to be that way. We wrote about it already last year.
In this post, we’ll get more hands-on. You’ll see how to turn any batch processing Python script into a real-time data pipeline orchestrated by Prefect. We’ll deploy the real-time streaming flow to a serverless containerized service running on AWS ECS Fargate — all resources will be deployed with Infrastructure as Code (leveraging CloudFormation), and the deployment process can be triggered with a single click from a GitHub Actions workflow.
With a CI/CD template, we’ll then ensure that future changes will be automatically redeployed with no manual intervention and no downtime.
Table of contents:1. Why Prefect 2.0 for real-time data pipelines?
∘ Drawbacks of batch processing
∘ Opinion: why you likely don’t need a distributed message queue
∘ How can Prefect 2.0 handle such low-latency real-time workflows?
∘ Benefits of moving towards real-time workflows with Prefect 2.0
∘ Why can I not just run a single DAG 24/7?2. Demo time!
∘ Prerequisites
∘ Typical batch-processing flow
∘ Turn it into a streaming service
∘ What if something goes wrong?3. Getting value from real-time: take automated action!
∘ Using Prefect Blocks to store key-value pairs
∘ Validate the data
∘ Conclusion on a local demo4. Deploy the real-time data pipeline as a serverless container
∘ Configure repository secrets
∘ Deploy the entire infrastructure in a single click
∘ Observe the real-time data pipelines in your Prefect UI5. Automate future deployments with CI/CD
∘ Making changes to the code
∘ (Optional) One flow run gets stuck in a Running state
∘ What happens when there are infrastructure issues?
∘ Limitations of the approach presented in…