2023 State of Data Infrastructure — Key Trends from Matt Turck’s MAD Landscape

Summary of the 2023 data infrastructure from the perspective of practitioner talking (a lot) to end-users from data communities

Data Infrastructure category of the 2023 MAD Landscape
2023 ML, AI, and Data (MAD) Landscape

Main changes in the 2023 data infrastructure

💀 Hadoop

🛥️ Data Lakes got merged into the same category as lakehouses

  • Cloudera (2008, $1041M) — an enterprise data hub built on top of Apache Hadoop.
  • Databricks (2013, $3497M) — their lakehouse platform is used for data integration and analytics services; Databricks introduced the lakehouse paradigm and are the category leader.
  • Dremio (2015, $405M) — a data analytics platform that allows business users to query data from any data source, then accelerate analytical processing for BI tools, ML, and SQL clients.
  • Onehouse (2021, $33M) — a cloud-native managed lakehouse service that helps to build data lakes, process data, and own data in open-source formats.
  • Azure Data Lake Storage — S3-like object storage service on Azure, commonly referred to as ADLS Gen 2
  • Azure HD Insight — same as above but for the Hadoop ecosystem
  • GCP’s Google BigLake — allows you to create BigLake tables on Google Cloud Storage (GCS), Amazon S3, and ADLS Gen 2 over supported open file formats, such as Parquet, ORC, and Avro.
  • GCP’s Google Cloud Dataproc — same as above but for the Hadoop ecosystem
  • AWS Lake Formation — makes it easier to manage an S3-based data lake with integration for Glue metadata catalog, Athena query engine, etc.
  • AWS’s Amazon EMR — same as above but for the Hadoop ecosystem (see the pattern between Cloud vendors?)
Data Lakes/Lakehouses

New single category: Data Quality & Observability

Data Quality & Observability

New database categories

New database categories

Fully managed data platforms

The growing importance of consulting services

Data & AI Consulting

Trends in the 2023 data infrastructure

Bundling and consolidation — from a buyer’s perspective

Bundling and consolidation — from a seller’s perspective

Bundling and consolidation — predictions for consolidation

  1. ETL and reverse ETL — similar to how Airbyte acquired Grouparoo, Fivetran could acquire one of its reverse ETL partners (Census or Hightouch)
  2. Data Quality & Observability — converging in the same direction whereby the data observability category is still so new that interested buyers often struggle to get organizational buy-in for such a purchase, especially during a recessionary period (when push comes to shove, those tools are considered “nice-to-have”; very few users I’ve interacted with are actively using, evaluating, or even asking for similar products or features despite their obvious importance and usefulness)
  3. Data Catalogs — there are too many players, and it’s unclear which ones will sustain and whether they can stand on their own without being tied to a larger (governance/observability or cloud data warehouse) platform.

The Modern Data Stack (MDS) is under pressure

The line between Reverse ETL and CDP becomes blurry

  1. Reverse ETL tools started to become CDP-like, providing direct customer data analytics without having to rely on other tools for that.
  2. CDPs started to become more reverse-ETL-like, integrating more closely with data warehouses.

Data mesh, products, contracts

How not to get overwhelmed by this big ecosystem

Next steps



Lead DX Engineer, Data Professional, Cloud & .py fan. www.annageller.com. Get my articles via email: https://annageller.medium.com/subscribe

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store