Serverless Real-Time Data Pipelines on AWS with Prefect, ECS and GitHub Actions

A guide to fully automated serverless real-time data pipelines

Anna Geller
15 min readJul 25, 2022

Most data platforms these days are still operated using batch processing. Even though streaming technology matured, building automated and reliable real-time data pipelines is still difficult and often requires a team of engineers to operate the underlying platform. But it doesn’t have to be that way. We wrote about it already last year.

In this post, we’ll get more hands-on. You’ll see how to turn any batch processing Python script into a real-time data pipeline orchestrated by Prefect. We’ll deploy the real-time streaming flow to a serverless containerized service running on AWS ECS Fargate — all resources will be deployed with Infrastructure as Code (leveraging CloudFormation), and the deployment process can be triggered with a single click from a GitHub Actions workflow.

With a CI/CD template, we’ll then ensure that future changes will be automatically redeployed with no manual intervention and no downtime.

Table of contents:1. Why Prefect 2.0 for real-time data pipelines?
Drawbacks of batch processing
Opinion: why you likely don’t need a distributed message queue
How can Prefect 2.0 handle

--

--