Architectural Foundations & Infrastructure - Part 1
The Importance of Architecture in a Data Platform
This is Part 1 of what I hope to be a long series focusing on foundational ideas and skills for building and maintaining Data Platforms.
When you first consider a data platform as a whole, the architecture and infrastructure serve as the foundation upon which everything else is built. For our purposes, we can use the two words, architecture and infrastructure, interchangeably, not precisely, but close enough.
The true cornerstone upon which you will build the entire house, the canvas on which the data picture is painted.
Without a well-defined architecture (logical) and a strong infrastructure (physical) to support a system, the rest of the platform will struggle to function efficiently, and execution will be hindered.
The biggest challenge in designing a data platform is that no two architectures (logical) look identical. Different business requirements, technological stacks, and industry constraints influence how a data platform is built.
However, common architectural patterns and best practices can guide the development of a resilient and scalable system. Yet, as I’ve learned over the decades working in and building data platforms, it’s more than simply picking the “right tool.”
Take, for example, three made-up companies for this discussion.
Acme Mfg Corp.
Acme FinTech.
Acme AgTech.
The businesses themselves could not be any different: the manufacturing of physical goods, the movement and transactions of money, and the technology side of “Big Agriculture.”
It would be no surprise that, from a technology perspective, the data platforms supporting these different operations would look completely different. But, surprisingly to some, the fundamental concepts underlying the “how” of building each of these data platforms would be the same.
It doesn’t matter whether we are producing widgets on an assembly line or processing satellite imagery of a farmer’s field; both data platforms need, for example, monitoring and observability in place.
Today’s post is sponsored by Estuary.
Without them, content like this isn’t possible. The best way to support this Newsletter is to check out what Estuary has to offer and click the links below.
Build millisecond-latency, scalable, future-proof data pipelines in minutes.
Estuary is the Right-Time Data Platform that integrates all of the systems you use to produce, process, and consume data. Also, providing best-in-class CDC (Change Data Capture).
Estuary unifies today’s batch and streaming paradigms so that your systems, current and future, are synchronized around the same datasets, updating in milliseconds.
Define
Back to the task at hand, before we get ahead of ourselves, we should define the world architecture in the context of data platforms for our discussion, so we are on the same page.
How about infrastructure?
Ok, now that we have those more educational definitions out of the way, what does that mean when the rubber hits the road?
The architecture of a data platform is where you put ideas on paper to describe and visualize how you want to approach building and bring together the components that make up the system.
The infrastructure is where we start building the technical details and tools we will use to implement the architecture and bring it to life, and make the decisions on which tools and frameworks to use.
Architecture is one of the most challenging transitions for an engineer. They move from a tactical role, such as writing individual data pipelines, to a more strategic role involving platform-wide decision-making. Instead of focusing on a single database or a specific ETL pipeline, platform architects must consider how the entire ecosystem fits together, ensuring that every component is built for performance, cost, and maintainability.
It’s also clear that we’ve moved from a time when building data platforms was solely the job of an architect to one where even Senior Engineers are expected to develop and maintain them, many times, from scratch.
This can be a hard transition, from the minute details of a daily pipeline to thinking about the entire systems upon which that pipeline operates.
Core Architectural Principles of a Modern Data Platform
Before diving into infrastructure and implementation details, let’s explore some guiding principles that should shape every decision when architecting a data platform from scratch, or trying to modernize an existing system:
Scalability: The platform should handle increased data volumes and additional workloads without significant performance degradation. It should scale.
Resilience & Fault Tolerance: Systems must be designed with failure in mind. Failures should not cause significant disruptions, and data should not be lost due to single points of failure. Can the data pipelines be rerun with a single click, or without any clicks at all?
Modularity & Flexibility: Components should be loosely coupled so that individual parts of the system can be upgraded or replaced without significant disruptions. We don’t want to over-tighten couplings between Platform components.
Security & Compliance: Data governance, access controls, and compliance with regulations such as GDPR and HIPAA must be incorporated into the design from the beginning. Who can do what, with what?
Observability & Monitoring: Engineers must be able to detect failures, measure performance, and diagnose issues effectively. Insight into system components is key.
Cost Efficiency: Designing the platform with cost efficiency in mind ensures that the business derives maximum value from its infrastructure investments. Don’t overengineer or complicate the architecture.
Also, while discussing the above technical approaches to bringing a data platform to life, we can’t forget the real-world value and intended purpose of the engineering systems we will build. It’s extremely easy for engineers at all levels to get overly excited about technical details, and that is a good thing, but it cannot be done in a vacuum.
There is another significant piece of architecture that is both the bane of most engineers and also the most important. The business requirements. Don’t you just love those words?
It seems this is where most engineers start to have their eyes glaze over. They just want to solve technical problems and stay as far away from the boring business stuff as possible. I have some bad news for you.
If you ignore the basic business requirements related to a data platform, you are building a house of cards on sand that will be all for nought, a waste of your time.
Applied examples.
Think about how different the business requirements for our three companies, Acme Mfg Corp, Acme FinTech, and Acme AgTech, are simply from the nature of their operations. If we were to sit with the C-suite from each of these companies, as a technology consultant, and were to ask the CEO and CTO what their core principles are, they might say …
Acme Mfg Corp.
We want to deliver high-quality products while reducing costs to our end customers through our wide distribution channels.
Acme FinTech
We want to provide cutting-edge payment solutions with high availability, performance, and zero downtime.
Acme AgTech
We want to provide an integrated solutions platform that delivers insights at scale to help farm operations drive profitability and sustainability.
Without knowing anything else about the operations of these businesses, an astute observer can already sense differences in each business model that will directly impact the data platform, and how one should build and maintain one; words like cost, performance, zero downtime, and scalability have a real-world impact on what a data platform underpinning these businesses would look like.
Business Value and Purpose
Under no circumstances should anyone start the architecture and planning process for either a greenfield (new) project or an upgrade/migration to a data platform without talking to and understanding the business at its core.
When I say business, I mean every single non-engineering or technical group that might have some sort of need or impact from the data and insights being produced. To be very clear, this is every single group or department inside a company.
Just as we saw above in the example of our made-up companies, the simple
one-sentence purpose and intent of a business impact the “how” and “what”
of a data platform, let alone the deeper, day-to-day operations in these
different environments.Here is a list of questions (non-inclusive) you could (and should) ask of your non-engineering counterparts and teams. Their answers will directly impact how you approach the data platform architecture and technical implementation.
What kind of data do you use or want access to to do your job better?
What data or data insights impact our end customer/user?
What data or data insights help this company provide better results/meet goals?
How often do you need the insights (freshness) of data?
How do you interact with the data today, and how do you want to do that in the future?
What is the budget for the data platform?
How big is the data team, and is it growing or stagnating?
What do you wish we could do with data tomorrow that we can’t today?The hard part of aligning the business’s needs with engineering’s desires is that you have to read between the lines. The company (besides engineering) isn’t technical and often doesn’t even know what’s possible.
Yet at the same time, the other departments, like Product, Marketing, and the C-Suite, are consumers of the data and insights we produce. Meeting their desires and expectations is key to building a healthy data platform and a Data Team that is seen as an integral part of the business. It’s also not surprising to have many teams and people with competing desires and data needs.
What might this look like in the real world? How could answers to some questions and the desires of different business units inform our data platform architecture?
Do we need batch, real-time, or near-realtime insights?
Does the business trust the data it has?
Does the business see data as an asset or a cost center?
What kind of analytics tools are expected, and what features are needed?
Do you have or need better Machine Learning capabilities?
What is the budget for the data platform?
What new products are coming?
Can we support those new needs on the current platform?
These are the real-world answers that will have an outsized impact on how you move forward with building a data platform. Do you need a Kappa or Lambda approach?
Will there be heavy use of data visualizations and analytical needs? Does the business need and want to interact with the data, or just consume basic reports? What are the expectations around the freshness of data? Are we going to have to be extremely budget-conscious, or is that not an issue?
All of these decisions have real-world infrastructure-related impacts; it might be the difference between a more costly SaaS version of some software, vs a cheaper but more time-intensive self-hosted solution.
What if we take just one of these business questions and apply it to our three companies?
Do we need batch, real-time, or near-realtime insights?
Acme Mfg Corp.
Medium-sized widget-producing factory that sends large orders to various distributors, which is very cost-conscious; batch systems of dealing with data insights are more than enough to meet business needs.
Acme FinTech
A high-tech company that deals with financial transactions; the performance and criticality of payment systems require streaming insights to meet business needs.
Acme AgTech
Medium-sized tech company working in the Big Ag space, dealing with scale; they have an interesting mix of needs, a large customer base, and huge datasets. Near-realtime will provide sufficient insights.
Only by truly understanding how a business operates can we find the answers to how data is produced, processed, and consumed to provide value.
It can be easy to gravitate towards what we are most comfortable with and ignore those topics we deem less necessary. This can be a very costly mistake.
Once a data platform is up and running, it’s much more challenging to make significant changes without disruption, like changing a flat tire while the car is still moving.
In the upcoming parts of this series, we will dive a little deeper into each item we listed above, which are essential to consider when understanding or designing the architecture and infrastructure of a data platform.
In Summary
Tools and frameworks change, and what is popular changes over time. Data platforms should be built and designed around a set of concepts and principles that rarely change.
This is the danger when we discuss technical topics like building data platforms. With the fast pace of technology changes, it’s essential to talk about timeless truths that transcend any single technology or concept(s) that is closely tied to them.
Architectural decisions define the success or failure of a data platform. While individual tools and technologies evolve, the foundational principles of scalability, modularity, observability, and cost-efficiency remain constant.










Good architecture is about building around what the business actually needs