Chapter 5 – Write Amplification and the Cost of Locality
Why Perfect Replication Collapses Under Its Own Weight
Here’s a seductive idea: put a copy of your data on every node. Every query becomes a local operation. Network failures? Irrelevant. Cross-region latency? Eliminated. Your database becomes a pure in-memory lookup—microseconds to serve any query.
This is the ultimate expression of data locality: complete replication. Every node has everything. Perfect availability, zero coordination for reads, guaranteed local performance.
There’s just one problem: writes.
In this chapter, we’re going to quantify exactly why “replicate everything everywhere” is a fantasy for most systems. We’ll do the math on storage throughput, calculate the bandwidth costs, model the disk wear, and examine the strategies systems use to mitigate write amplification. By the end, you’ll understand why perfect locality often collapses under write load—and what you can do about it.
The Fundamental Problem: 1 Write → N Writes
Let’s start with the simplest possible scenario: a cluster of 10 nodes, fully replicated.
You write a record. That record is 1KB. How much physical I/O happens?
Naive replication: Each of the 10 nodes must write 1KB to disk. Total: 10KB of physical writes.
Your application sees “1 write” but the storage layer sees “10 writes.” This is 10× write amplification.
Now scale to 100 nodes. Same 1KB record. Total: 100KB of physical writes. Your write amplification factor is 100×.
“So what?” you might ask. “Storage is cheap. SSDs are fast. Why does this matter?”
Let’s do the math.
Storage Throughput Limits
Modern datacenter SSDs (like the Samsung PM9A3 or Intel P5800X) can sustain approximately:
100,000 IOPS (I/O operations per second) for random writes
3,000 MB/s sequential write throughput[1]
For simplicity, let’s assume 1KB writes and focus on IOPS. Your SSD can handle 100,000 writes/second.
Single node (no replication):
Application writes: 100,000/second
Physical writes: 100,000/second
SSD utilization: 100%
10-node cluster (full replication):
Application writes: 10,000/second
Physical writes per node: 10,000/second
Total physical writes: 100,000/second
SSD utilization: 100%
Your effective write capacity dropped 10×. You can only handle 10,000 application-level writes before saturating the storage.
100-node cluster (full replication):
Application writes: 1,000/second
Physical writes per node: 1,000/second
Total physical writes: 100,000/second
SSD utilization: 100%
Your effective write capacity dropped 100×. With 100 nodes, each with a high-end SSD, you can only handle 1,000 writes/second at the application level.
This is the write amplification trap: adding nodes increases your capacity for reads (more nodes can serve more read traffic) but actually decreases your capacity for writes.
Storage Endurance: The Hidden Cost
SSDs don’t last forever. They have a finite number of write cycles before cells wear out. This is measured in Drive Writes Per Day (DWPD) or Total Bytes Written (TBW)[2].
A typical datacenter SSD:
1TB capacity
3 DWPD rating (can write 3TB/day for 5 years)
Total endurance: 3TB/day × 365 days × 5 years = 5,475 TB lifetime writes
Under normal usage (1 DWPD actual load), your drive lasts 15 years. Great.
Now add write amplification.
10-node cluster with full replication:
Actual write load: 1 DWPD at application level
Physical writes per node: 10 DWPD (due to 10× replication)
Expected lifetime: 5,475 TB / (10 TB/day) = 547 days (1.5 years)
100-node cluster with full replication:
Actual write load: 1 DWPD at application level
Physical writes per node: 100 DWPD (due to 100× replication)
Expected lifetime: 5,475 TB / (100 TB/day) = 55 days (under 2 months)
Your drives are burning out 30-100× faster than expected. You’re replacing SSDs constantly. Your operational costs skyrocket.
This is not theoretical. I’ve seen production HarperDB clusters with aggressive replication require SSD replacement every 6-9 months instead of the expected 5+ years. The write amplification was literally destroying hardware.
Bandwidth: The Economic Dimension
Storage wear is a operational problem. Bandwidth is a financial problem.
Every write must be transmitted to N-1 other nodes. Let’s model a 100-node cluster handling 10,000 writes/second with 1KB records:
Per-node replication traffic:
Outbound: 99 replicas × 10,000 writes/sec × 1KB = 990 MB/second = 35.6 TB/day
Inbound: 99 sources × 10,000 writes/sec × 1KB = 990 MB/second = 35.6 TB/day
Total per node: 71.2 TB/day
Cluster-wide replication traffic:
100 nodes × 71.2 TB/day = 7,120 TB/day (7.1 PB/day)
Now let’s price this. Assuming nodes are distributed across 3 regions:
Cross-region bandwidth costs (AWS pricing):
$0.02/GB same-region
$0.05/GB cross-region
If 2/3 of your replication traffic crosses regions:
7,120 TB/day × 0.67 (cross-region ratio) × 1,024 GB/TB × $0.05/GB = $244,000/day
Monthly bandwidth cost: $7.3 million
That’s just the bandwidth. Add the compute costs for serialization, deserialization, and coordination, and you’re approaching $10 million/month in infrastructure costs—for a system handling 10,000 writes/second.
For comparison, a well-architected sharded system handling the same load might cost $50k-100k/month.
Replication Strategies and Their Trade-offs
“Clearly full replication doesn’t scale,” you’re thinking. “So what are the alternatives?”
Let’s examine the spectrum of replication strategies and their costs.
Strategy 1: Asynchronous Replication
Approach: Write to local node immediately, replicate to other nodes in the background.
Write amplification: Still N× (must eventually write to all N nodes)
Latency impact: Minimal—application sees fast local write (1-10ms)
Consistency: Eventual. Recent writes might not be visible on all nodes yet.
Failure mode: If the node crashes before replicating, writes are lost.
Example: Cassandra with consistency level ONE, MongoDB with w:1 write concern[3][4].
When it works: High write throughput requirements where you can tolerate lost writes (logs, metrics, clickstream data).
When it fails: Financial transactions, inventory management, anything where data loss is unacceptable.
Strategy 2: Synchronous Quorum Replication
Approach: Write to N nodes synchronously, but only wait for a quorum (majority) to acknowledge.
Write amplification: Still N× physical writes, but latency determined by the fastest Q nodes (where Q > N/2)
Latency impact: Higher than async but not as bad as waiting for all N nodes. For 5-node cluster with quorum of 3, you wait for the 3rd-fastest node.
Consistency: Strong. Any quorum read will see any quorum write.
Failure mode: Can tolerate N-Q failures and remain available.
Example: Cassandra with QUORUM, CockroachDB, etcd, Consul[3][5][6].
Cost model (5-node cluster, 10k writes/sec):
Physical writes: 50k writes/second across cluster
SSD utilization: 50% (10k writes/sec per node)
Bandwidth: 5× amplification (write goes to 4 other nodes)
This is better than full synchronous replication but still 5× the storage and bandwidth cost of a single node.
Strategy 3: Leader-Follower with Selective Replication
Approach: Designate a leader per partition. Leader handles writes, replicates to a subset of followers.
Write amplification: 2-3× (leader + 1-2 replicas)
Latency impact: Moderate (one cross-region hop if replicas are distant)
Consistency: Strong within replica set, but scope is limited to partition.
Failure mode: If leader fails, elect new leader from followers (10-30 second failover)
Example: PostgreSQL with streaming replication, MySQL with primary-replica[7][8].
Cost model (3-replica setup, 10k writes/sec):
Physical writes: 30k writes/second across cluster
SSD utilization: 30% per node
Bandwidth: 3× amplification
This is the sweet spot for many applications: strong consistency, manageable overhead, proven at scale.
Strategy 4: Append-Only Logs with Compaction
Approach: Instead of replicating individual writes, replicate an append-only log of all operations. Periodically compact the log by removing superseded entries.
Write amplification: Initially high (every write creates a log entry), but compaction reduces long-term storage.
Latency impact: Low for writes (append to log), higher for reads (must replay log or query compacted state)
Consistency: Eventually consistent (log replay takes time)
Failure mode: Log corruption or loss can affect all downstream replicas
Example: Apache Kafka, AWS DynamoDB Streams, Cassandra’s commit log[9][10].
Cost model:
Write amplification: 2-5× before compaction, 1-2× after
Compaction overhead: 10-30% of CPU cycles
Storage: Depends on compaction frequency and retention policy
The log approach decouples write durability from replication, allowing async propagation while maintaining durability.
Strategy 5: CRDTs (Conflict-Free Replicated Data Types)
Approach: Use data structures with mathematical properties that guarantee eventual consistency without coordination.
Write amplification: N× (all nodes eventually get all writes)
Latency impact: Minimal—writes are local, reconciliation is background
Consistency: Strong eventual consistency (all replicas converge to same state)
Failure mode: Requires careful data structure design; not all operations can be expressed as CRDTs
Example: Riak’s data types, Redis CRDTs, Automerge[11][12].
Constraints: Only works for specific operations:
Counters (increment/decrement)
Sets (add/remove)
Registers (last-write-wins)
Graphs (add node/edge)
Cannot express arbitrary transactions or enforce constraints globally (e.g., “ensure balance never goes negative”).
Cost model:
Write amplification: N× but async
Metadata overhead: 50-200% (vector clocks, version vectors)
Computation overhead: 10-40% (merge algorithms)
CRDTs are elegant for specific use cases but not a general-purpose solution.
Performance Curves: Where Systems Break Down
Let’s model how different replication strategies perform as cluster size grows.
Setup:
Each node can handle 100k writes/second
1KB records
Target: maintain 50k application writes/second
Single-node (no replication):
Physical writes: 50k/second
SSD utilization: 50%
Result: Works fine
3-node quorum replication:
Physical writes per node: 50k/second (all writes go to all nodes)
SSD utilization: 50%
Result: Works fine, with redundancy
10-node full replication:
Physical writes per node: 50k/second
SSD utilization: 50%
Result: Still works, but approaching limits
100-node full replication:
Target app writes: 50k/second
Required physical writes per node: 50k/second
SSD utilization: 50%
Result: Barely works
1000-node full replication:
Target app writes: 50k/second
Required physical writes per node: 50k/second
SSD utilization: 50%
But wait: Network bandwidth becomes the bottleneck
Each node must receive 999 × 50k writes/sec × 1KB = 48 GB/second
Typical datacenter NIC: 10-25 Gbps (1.25-3.1 GB/second)
Result: Network saturates at ~2,500 writes/second total, not 50k
The system collapses. You’ve added 1,000 nodes and your write capacity is 20× worse than a single node.
Real-World Mitigation Strategies
Systems that successfully operate at scale use combinations of techniques to manage write amplification:
Technique 1: Intelligent Sharding
Instead of replicating everything everywhere, partition data and replicate partitions selectively.
HarperDB approach: Sub-databases with configurable replication. Critical data (user accounts) might replicate everywhere. Transactional data (orders) shards by geography. Analytics data (logs) replicates to centralized warehouse only[13].
Result: Write amplification averages 3-5× instead of 100×.
Technique 2: Lazy Replication
Replicate immediately to a small set of synchronous replicas (durability), then lazily replicate to additional nodes (availability).
Cassandra approach: Consistency level can be LOCAL_QUORUM (fast) for writes, but data still eventually replicates to all nodes in all datacenters[3].
Result: Write latency stays low (10-50ms) but you pay bandwidth cost eventually.
Technique 3: Hierarchical Replication
Organize nodes in a hierarchy. Writes propagate through leaders at each level.
Example topology:
Global Leader
/ | \
DC1 Leader | DC3 Leader
| | |
10 nodes 10 nodes 10 nodes
Write goes to DC1 leader → propagates to DC1 nodes → eventually to other DC leaders → propagates to their nodes.
Result: Reduces cross-DC traffic from N² to 2N.
Technique 4: Delta Replication
Instead of replicating entire records, replicate only the changed fields.
MongoDB approach: OpLog contains operations (e.g., “increment counter by 1”) not full documents[4].
Result: Write amplification measured in bytes, not kilobytes. A counter increment might be 100 bytes instead of 1KB record.
Technique 5: Time-Based Eviction
Keep hot data replicated, evict cold data to single-copy storage.
Pattern:
Recent 7 days: full replication (3×)
8-30 days: single region (1×)
30+ days: cold storage (0.1×)
Result: Write amplification drops over time as data cools.
The Write Amplification Tax in Practice
Let’s model a real-world e-commerce application:
Workload:
1 million active users
10 writes/second/user peak (checkout flow)
1KB average record size
99.99% availability requirement
Option 1: Full Replication (10 nodes)
Application writes: 10M/second peak
Physical writes: 100M/second (10× amplification)
Problem: Exceeds hardware capacity by 100×. Doesn’t work.
Option 2: Selective Replication
User accounts: 3× replication (critical)
Product catalog: 5× replication (read-heavy)
Shopping carts: 3× replication (transient)
Orders: 1× initially, replicate to warehouse async
Logs: 1×, stream to analytics
Effective write amplification: ~2.5×
Physical writes: 25M/second peak
With 10 nodes: 2.5M writes/second/node
Result: Within hardware capacity, much better cost structure
Cost comparison:
Full replication: Would require ~100 nodes minimum = ~$200k/month
Selective replication: 10 nodes = ~$20k/month
That’s 10× cost difference for the same workload, just by being selective about what replicates where.
When Perfect Locality Is Worth the Cost
Despite everything we’ve covered, there are scenarios where full replication makes sense:
Small, critical datasets: If your entire dataset is 10GB and changes infrequently, replicate it everywhere. The cost is negligible and availability is perfect.
Read-heavy workloads: If you have 1M reads/second and 100 writes/second, the write amplification cost is dwarfed by the read performance benefit.
Compliance requirements: Some regulations require data to be available locally for audit purposes. Write amplification is the cost of compliance.
Disaster recovery: Keeping a full replica in a geographically distant location for DR purposes is expensive but necessary for some businesses.
The key is being intentional. Don’t replicate everything because it’s easier than thinking about data placement. Replicate because you’ve done the math and decided the cost is worth the benefit.
The Path Forward
We’ve established that write amplification is the fundamental constraint on “data everywhere” architectures. You can mitigate it with clever replication strategies, hierarchical topologies, and selective placement—but you cannot eliminate it.
This leads to an uncomfortable conclusion: perfect locality is impossible at scale for write-heavy workloads.
If you can’t put everything everywhere, you must make choices: which data lives where? Based on what criteria? And how do you make those decisions systematically instead of through manual configuration?
In the next chapter, we’ll examine sharding and partitioning—the classic approach to avoiding write amplification by splitting data across nodes. We’ll see how geographic partitioning can give you locality for many queries while avoiding full replication costs.
But we’ll also see the new problems this creates: cross-shard queries, rebalancing overhead, and the challenge of choosing the right partition key when access patterns change over time.
Because here’s the thing about distributed systems: you never solve problems, you just trade them for different problems. The art is picking problems you can live with.
References
[1] Samsung, “PM9A3 NVMe SSD Specifications,” Product Datasheet, 2023. [Online]. Available: https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/
[2] J. Schindler et al., “Understanding SSD Endurance and Write Amplification in Enterprise Storage,” ACM Transactions on Storage, vol. 11, no. 4, pp. 1-27, 2015.
[3] A. Lakshman and P. Malik, “Cassandra: A Decentralized Structured Storage System,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, pp. 35-40, 2010.
[4] K. Chodorow, “MongoDB: The Definitive Guide,” O’Reilly Media, 3rd ed., 2019.
[5] R. Taft et al., “CockroachDB: The Resilient Geo-Distributed SQL Database,” Proc. 2020 ACM SIGMOD International Conference on Management of Data, pp. 1493-1509, 2020.
[6] etcd, “etcd Documentation: Understanding Failure,” 2024. [Online]. Available: https://etcd.io/docs/
[7] PostgreSQL, “High Availability, Load Balancing, and Replication,” PostgreSQL Documentation, 2024. [Online]. Available: https://www.postgresql.org/docs/
[8] MySQL, “Replication,” MySQL Documentation, 2024. [Online]. Available: https://dev.mysql.com/doc/refman/8.0/en/replication.html
[9] J. Kreps et al., “Kafka: A Distributed Messaging System for Log Processing,” Proc. 6th International Workshop on Networking Meets Databases, 2011.
[10] G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key-value Store,” Proc. 21st ACM Symposium on Operating Systems Principles, pp. 205-220, 2007.
[11] M. Shapiro et al., “Conflict-free Replicated Data Types,” Proc. 13th International Symposium on Stabilization, Safety, and Security of Distributed Systems, pp. 386-400, 2011.
[12] M. Kleppmann and A. R. Beresford, “A Conflict-Free Replicated JSON Datatype,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 10, pp. 2733-2746, 2017.
[13] HarperDB, “Sub-databases and Component Architecture,” Technical Documentation, 2024. [Online]. Available: https://docs.harperdb.io/
Next in this series: Chapter 6 - Sharding, Partitioning, and Data Residency, where we’ll explore how intelligently splitting data across nodes can reduce write amplification—and discover the new complexity this introduces.

