Chapter 1 – The Data Locality Spectrum

How the Distance Between Your Code and Your Data Defines Everything

Oct 28, 2025

There’s a number that haunts every distributed system architect. It’s not a security vulnerability score or a cost multiplier. It’s simpler, more fundamental, and completely immutable: 47 milliseconds.

That’s roughly how long light takes to travel from San Francisco to London and back—the absolute physical minimum for a round-trip network request between those cities[1]. Your database query can’t be faster than physics. No amount of optimization, no clever caching strategy, no revolutionary new protocol can change the speed of light. And yet, we’ve spent the last two decades building systems that pretend this constraint doesn’t exist.

This is the central tension in modern distributed systems: the growing distance between where computation happens and where data lives. Understanding this distance—what I call the data-locality spectrum—is the key to understanding why your system is slow, expensive, or fragile.

The Evolution of Distance

Let’s rewind twenty years. You’re running a monolithic application. The database lives on the same physical machine as your application server, or at worst, on another machine in the same rack. Query latency? Sub-millisecond. Network failures? Irrelevant. Data consistency? Trivial—there’s only one copy. The computation and the data are co-located.

This architecture had problems—scaling was vertical, failures were catastrophic, and deployment was a nightmare—but it had one transcendent virtue: data was local. The physical distance between your SELECT statement and the rows it retrieved could be measured in centimeters.

Then we discovered microservices. We sharded our monoliths into dozens, then hundreds of independent services. We moved to the cloud. We deployed across multiple availability zones for resilience. We replicated to multiple regions for performance. Each step made our systems more scalable, more resilient, more flexible.

Each step also increased the distance between computation and data.

Defining the Spectrum

The data-locality spectrum represents the physical and logical distance between where your code executes and where your data persists. This distance manifests in two dimensions:

Physical distance: The actual geographic separation, measured in kilometers and ultimately bounded by the speed of light. A query to a database in the same process is different from a query to a database in the same datacenter, which is different from a query to a database on another continent.

Logical distance: The number of network boundaries, consistency protocols, and coordination steps between computation and storage. A read from an embedded SQLite database requires no network I/O and no coordination. A read from a globally-distributed Spanner database might involve quorum protocols across three continents.

The spectrum runs from one extreme to the other:

Application-Local
Data lives inside the application process or on locally-attached storage. Every query is a local operation. Latency is measured in microseconds. Network failures are someone else’s problem. This is the architecture HarperDB pioneered with its composable application platform—the database is literally part of your application runtime[2].

Application Process
├── Application Logic
└── Database Engine (embedded)
    └── Local Disk
    
Query latency: 1-10 microseconds
Network hops: 0
Failure domains: 1

Regional Clusters
Data lives in a cluster of machines within a single datacenter or availability zone. Queries involve network round trips but within a controlled, high-bandwidth environment. This is your typical PostgreSQL primary-replica setup or a single-region Cassandra cluster[3].

Application Servers (AZ-1)
    ↓ 1-2ms
Database Cluster (AZ-1)
├── Primary Node
└── Replica Nodes
    
Query latency: 1-5 milliseconds
Network hops: 1-2
Failure domains: 2-3

Multi-Region, Eventually Consistent
Data is replicated across geographic regions. Writes go to the nearest region; reads can be served locally but might be stale. This is DynamoDB Global Tables, Cassandra with multi-DC replication, or MongoDB with geographically distributed replica sets[4].

Application (US-West)          Application (EU-West)
    ↓ 1-2ms                         ↓ 1-2ms
Database (US-West) ←---80ms---→ Database (EU-West)
    
Local query latency: 1-5 milliseconds
Cross-region write latency: 50-150 milliseconds
Consistency: Eventually
Failure domains: N regions

Multi-Region, Strongly Consistent (Spanner-style)
Data is replicated across regions, but all writes require coordination across multiple regions to maintain strong consistency. Every write must achieve quorum across geographically distributed nodes. This is Google Spanner, CockroachDB, or YugabyteDB in their strictest consistency modes[5].

Application (US-West)
    ↓
Consensus Protocol
├── Node (US-West) ---80ms--- Node (EU-West)
├── Node (US-East) ---60ms--- Node (EU-West)
└── Node (EU-West)
    
Write latency: 100-300 milliseconds
Read latency: 1-5ms (nearest replica) or 50-150ms (linearizable)
Consistency: Strict serializability
Coordination overhead: Paxos/Raft across all writes

The Non-Linear Scaling of Distance

Here’s where it gets interesting: the costs of distance don’t scale linearly. Double the physical distance, and you don’t just double the latency—you multiply the coordination complexity, amplify the failure surface, and compound the consistency challenges.

Consider a simple write operation:

Application-local: The write hits the local storage engine. If you’re using an embedded database like SQLite or HarperDB’s in-process engine, the write might involve an fsync to disk. Cost: ~10 milliseconds on modern SSDs. No network involved.

Regional cluster with 3 replicas: The write goes to the primary, which must replicate to N-1 replicas. If you’re using synchronous replication, you wait for acknowledgment from a quorum. Cost: 5-10 milliseconds of replication latency, plus the probability of network failures between nodes.

Multi-region with 9 replicas (3 per region): The write must coordinate across three geographic regions. Even with eventual consistency, you’re paying for cross-region bandwidth and dealing with the probability that one of those regions is temporarily unreachable. Cost: 50-150 milliseconds, plus the complexity of conflict resolution.

Multi-region with strong consistency: The write cannot complete until a quorum of geographically distributed nodes agrees. You’re paying the physics tax on every single write. Cost: 100-300 milliseconds for the coordination protocol alone.

This isn’t just about latency—it’s about the compound probability of failure. Every network hop introduces a new failure mode. Every consistency protocol introduces new edge cases. Every geographic region introduces new regulatory considerations.

The Central Question

Which brings us to the question that will drive this entire series: Is there an equilibrium between speed and reach?

The application-local approach gives you incredible performance but limited scale. You can process millions of requests per second—as long as they all hit the same node and fit in local storage. The moment you need to shard, you’ve lost the purity of the model. Now you have network calls, distributed queries, and the coordination overhead you were trying to avoid.

The global distributed approach gives you unlimited scale and geographic reach. You can serve users in Tokyo and London from their nearest datacenter. But you pay the physics tax on every operation. Your P99 latencies are measured in hundreds of milliseconds. Your error handling code dwarfs your business logic.

Neither extreme is the answer for most systems. Yet we keep building systems at the extremes because the middle ground is harder to reason about. It requires admitting that different data has different locality requirements. Your user session? That should be local. Your global inventory count? That probably needs to be distributed. Your audit log? That can be eventually consistent across regions.

The real challenge isn’t choosing between local and distributed—it’s building systems that can span the entire spectrum intelligently, placing each piece of data at the point on that spectrum where the trade-offs make sense for its access patterns, durability requirements, and consistency needs.

Over the next several chapters, we’ll explore this spectrum in detail. We’ll examine the physics that constrains us, the patterns that work at each point on the spectrum, and the emerging approaches that might let us have our cake and eat it too—systems that are both fast and globally available, strongly consistent where it matters and eventually consistent where it doesn’t, simple to operate but powerful enough for the most demanding workloads.

We’ll look at what happens when you try to operate at each extreme. We’ll quantify the trade-offs. And we’ll explore whether there’s a path toward systems that automatically optimize data placement across the entire spectrum—an intelligent data plane that puts the right data in the right place at the right time.

Because ultimately, the system that figures out how to strike this balance with the fewest moving parts for the largest number of applications will become the foundation of distributed data infrastructure for the next decade.

The speed of light isn’t changing. But perhaps our relationship with it can.

References

[1] C. Bauer, “Network Latency Considerations in Distributed Systems,” ACM Computing Surveys, vol. 52, no. 3, pp. 1-35, 2019.

[2] HarperDB, “HarperDB Technical Architecture,” Technical Documentation, 2023. [Online]. Available: https://docs.harperdb.io/

[3] A. Lakshman and P. Malik, “Cassandra: A Decentralized Structured Storage System,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, pp. 35-40, 2010.

[4] G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key-value Store,” Proc. 21st ACM Symposium on Operating Systems Principles, pp. 205-220, 2007.

[5] J. C. Corbett et al., “Spanner: Google’s Globally-Distributed Database,” Proc. 10th USENIX Symposium on Operating System Design and Implementation, pp. 261-264, 2012.

Next in this series: Chapter 2 - The Physics of Distance, where we’ll quantify exactly what the speed of light costs us and why perfect software still can’t beat geography.

It Should Just Work®

Discussion about this post