Serverless Real-Time Data Pipelines on AWS with Prefect, ECS and GitHub Actions
A guide to fully automated serverless real-time data pipelines
Most data platforms these days are still operated using batch processing. Even though streaming technology matured, building automated and reliable real-time data pipelines is still difficult and often requires a team of engineers to operate the underlying platform. But it doesn’t have to be that way. We wrote about it already last year.
In this post, we’ll get more hands-on. You’ll see how to turn any batch processing Python script into a real-time data pipeline orchestrated by Prefect. We’ll deploy the real-time streaming flow to a serverless containerized service running on AWS ECS Fargate — all resources will be deployed with Infrastructure as Code (leveraging CloudFormation), and the deployment process can be triggered with a single click from a GitHub Actions workflow.
With a CI/CD template, we’ll then ensure that future changes will be automatically redeployed with no manual intervention and no downtime.
Table of contents:1. Why Prefect 2.0 for real-time data pipelines?
∘ Drawbacks of batch processing
∘ Opinion: why you likely don’t need a distributed message queue
∘ How can Prefect 2.0 handle…