2023 State of Data Infrastructure — Key Trends from Matt Turck’s MAD Landscape
Summary of the 2023 data infrastructure from the perspective of practitioner talking (a lot) to end-users from data communities
Matt Turck has recently published the 2023 MAD (Machine Learning, Artificial Intelligence & Data) Landscape overview. Similar graphics have been made for 2012, 2014, 2016, 2017, 2018, 2019 (Part I and Part II), 2020, and 2021, and now there is a PDF and an interactive version of the 2023 landscape.
This post will briefly summarize only this “small” fraction of the MAD ecosystem that relates to data infrastructure:
Main changes in the 2023 data infrastructure
💀 Hadoop
While some parts of the Hadoop ecosystem are still being used (e.g., Hive), its usage has declined enough that Hadoop is no longer included in the landscape map. Based on recent Big Data is Dead posts, this checks out.
🛥️ Data Lakes got merged into the same category as lakehouses
Those include a.o. the following tools (founding year and total funding included in brackets for reference when applicable):
- Cloudera (2008, $1041M) — an enterprise data hub built on top of Apache Hadoop.
- Databricks (2013, $3497M) — their lakehouse platform is used for data integration and analytics services; Databricks introduced the lakehouse paradigm and are the category leader.
- Dremio (2015, $405M) — a data analytics platform that allows business users to query data from any data source, then accelerate analytical processing for BI tools, ML, and SQL clients.
- Onehouse (2021, $33M) — a cloud-native managed lakehouse service that helps to build data lakes, process data, and own data in open-source formats.
- Azure Data Lake Storage — S3-like object storage service on Azure, commonly referred to as ADLS Gen 2
- Azure HD Insight — same as above but for the Hadoop ecosystem
- GCP’s Google BigLake — allows you to create BigLake tables on Google Cloud Storage (GCS), Amazon S3, and ADLS Gen 2 over supported open file formats, such as Parquet, ORC, and Avro.
- GCP’s Google Cloud Dataproc — same as above but for the Hadoop ecosystem
- AWS Lake Formation — makes it easier to manage an S3-based data lake with integration for Glue metadata catalog, Athena query engine, etc.
- AWS’s Amazon EMR — same as above but for the Hadoop ecosystem (see the pattern between Cloud vendors?)
New single category: Data Quality & Observability
The data quality category merged with data observability, indicating a potential for consolidation due to an increasing overlap of their functionality. Tools in this space include Precisely, Talend, Collibra, Manta, Unravel Data, Great Expectations, SodaData, Anomalo, Acceldata, Monte Carlo, Bigeye, Validio, Databand, Lightup, Metaplane, Datafold, Timeseer, Sifflet, Synq.
How does it affect buyers and end-users? This large category is not even complete because similar metadata platforms are included in separate categories (Data Governance & Catalog, Data Access), even though they solve the same problem as the above-mentioned tools: collecting metadata and using it for observability, data quality, discovery, collaborative knowledge sharing, and troubleshooting. For an average data team in an SMB, all those three categories provide fantastic tools which serve as a great addition to a stack, but they often aren’t indispensable. In contrast, the same tools are mission-critical for big enterprises because they need them for governance, stricter controls, and compliance reasons. This means that many of those companies strive to become the default tool that serves big enterprise clients. But too many of them are chasing large customers. Companies that seem to be successful in that sector are Monte Carlo, Acceldata, and Collibra. The rest has either an established customer base they try to upsell, or needs to figure out how to speak to SMBs to persuade them this problem is important enough to pay for. Metaplane is an interesting outlier because they seem to be also covering smaller teams and even provide a free forever plan for a 1-person team.
New database categories
There are new database categories for GPU-, vector- and serverless workloads.
Note: Even though it’s not shown in the MAD landscape, there is a renaissance of embedded databases with DuckDB for OLAP, KuzuDB for Graph, SQLite for RDBMS, Chroma for search, and RocksDB for key-value.
Fully managed data platforms
All-in-one platforms emerged as a category of tools that promise a more holistic, out-of-the-box experience as an alternative to the Modern Data Stack. Those include Mozart Data, Y42, FruitionData, Keboola, Nexla, 5x, Adverity, and Data Virtuality. I’ll cover those in a separate follow-up post.
The growing importance of consulting services
Due to the expanding and increasingly complex ecosystem, “Data & AI Consulting” services became so important that they also got their own individual category.
Trends in the 2023 data infrastructure
Bundling and consolidation — from a buyer’s perspective
Buyers experience budget pressure and more CFO oversight. Instead of picking the best tools, practitioners are incentivized by their management to pick a tightly integrated all-in-one product to better control costs (one vendor and contract to negotiate). Data professionals are asked to do more with less. No new hires and no resources to experiment with unproven tools.
Bundling and consolidation — from a seller’s perspective
There are too many companies with overlapping feature sets or, even worse, “single-feature” startups that focus on narrow categories — too narrow to stand on their own long-term. This includes reverse ETL, metrics stores, data catalogs, etc. While each of those companies hopes to become a bigger platform, they are not profitable enough. Since their cash runway typically ranges from months to up to 3 years, they will have to raise their next round or find a new “home” through acquisition to avoid bankruptcy.
On the opposite side of the spectrum, Snowflake and Databricks compete to become the default data & ML platforms. They aggressively keep expanding their offering to cover a wider spectrum of the data infrastructure. They both made several acquisitions (and will likely continue to do so) to grow their market share and range of features. Confluent, the Kafka company, has followed a similar approach by acquiring Immerok, the company behind Flink.
Bundling and consolidation — predictions for consolidation
Here are categories of tools ripe for consolidation:
- ETL and reverse ETL — similar to how Airbyte acquired Grouparoo, Fivetran could acquire one of its reverse ETL partners (Census or Hightouch)
- Data Quality & Observability — converging in the same direction whereby the data observability category is still so new that interested buyers often struggle to get organizational buy-in for such a purchase, especially during a recessionary period (when push comes to shove, those tools are considered “nice-to-have”; very few users I’ve interacted with are actively using, evaluating, or even asking for similar products or features despite their obvious importance and usefulness)
- Data Catalogs — there are too many players, and it’s unclear which ones will sustain and whether they can stand on their own without being tied to a larger (governance/observability or cloud data warehouse) platform.
The Modern Data Stack (MDS) is under pressure
MDS represents tools often considered to be bleeding edge and even elitist. To properly adopt those best-in-breed tools, engineers need to spend their (expensive) time stitching those tools together based on companies’ needs. Generally, this provides great flexibility, modularity, and adaptability from an engineering perspective. However, during times when companies are trying to cut costs, many buyers are inclined to sacrifice some engineering ideals in favor of more integrated products. Fully managed solutions such as Y42 and Keboola have recently gained popularity in this category.
The line between Reverse ETL and CDP becomes blurry
Customer Data Platform is a fairly new category of products that aggregate data from multiple sources, perform segmentation and other analytics, and feed this data back to SaaS for marketing campaigns. Reverse ETL has been frequently used for the same purpose — after you analyzed data in some cloud data warehouse using, e.g., HEX or dbt, you’d feed it back to SaaS. Both categories started to realize that what they do is not enough, and they need to expand their scope:
- Reverse ETL tools started to become CDP-like, providing direct customer data analytics without having to rely on other tools for that.
- CDPs started to become more reverse-ETL-like, integrating more closely with data warehouses.
Both categories converge in the same direction.
Data mesh, products, contracts
In many organizations, the data stack looks like a mini version of the MAD landscape (some logos would be replaced by names of homegrown systems). Data mesh emerged as one approach to deal with organizational complexity.
While data fabric is purely a technical concept (a single framework connecting data sources, regardless of where they’re physically located), data mesh governs both tools and teams. Its core concept is data products which may be curated data assets, models, or APIs. Each independent domain-oriented data team owns and is kept accountable for managing its data products, incl. SLAs, quality, and provides those to data consumers as a self-service. Some believe that data contracts can also help establish better boundaries between data producers and consumers to ensure data quality.
How not to get overwhelmed by this big ecosystem
Try not to adopt tools until there is a clear use case. You can start with the simplest stack using BigQuery, an ingestion tool, and a notebook tool for analytics. This way, you’ll have a serverless cloud data warehouse (BigQuery), a pay-as-you-go data ingestion tool, and a flexible, low-floor, high-ceiling tool for analytics and reporting (HEX).
Next steps
If you enjoy reading more condensed overviews of blog posts and trends in the data industry, subscribe to the “Cut to the chase” newsletter.