Chapter 6 – Sharding, Partitioning, and Data Residency
How to Split Data Without Breaking Your Application
In Chapter 5, we established that write amplification makes full replication untenable at scale. You can’t put all data on all nodes—the storage, bandwidth, and operational costs become prohibitive.
The obvious solution: don’t replicate everything. Instead, split your data across nodes. Each piece of data lives on a subset of nodes, not all of them. Writes only replicate within that subset. Write amplification becomes 3× instead of 100×.
This is sharding—horizontal partitioning of data across multiple nodes. It’s one of the oldest techniques in distributed systems, dating back to the early days of distributed databases[1]. It’s also one of the most problematic. Because when you split your data, you split your performance characteristics, your operational complexity, and your failure modes right along with it.
Let’s explore how sharding works, where it succeeds, and where it creates new problems that are sometimes worse than the ones it solves.
The Basic Concept: Horizontal Partitioning
Imagine a users table with 100 million records. You have 10 database nodes. Instead of replicating all 100M records to all 10 nodes (10 billion total records), you split the table:
Node 1: users 0-9,999,999 (10M records)
Node 2: users 10,000,000-19,999,999 (10M records)
Node 3: users 20,000,000-29,999,999 (10M records)
...
Node 10: users 90,000,000-99,999,999 (10M records)
Each node holds 10% of the data. Queries for a specific user go to one node. Write amplification is 1× (plus any replication within the node’s replica set, typically 3×).
This is horizontal partitioning or sharding. The terms are often used interchangeably, though “sharding” typically implies distributed nodes while “partitioning” might refer to logical separation within a single database.
The benefits are immediate:
Storage: Each node only needs 10% of the capacity
Write throughput: 10 nodes × 100k writes/sec = 1M writes/sec total capacity
Cost: Linear scaling—10× data = 10× nodes, not 10× amplification
But there’s a catch: how do you know which node has which user?
Partition Key Selection: The Most Important Decision
The partition key (or shard key) determines which node owns which data. Choose poorly and your sharding strategy collapses. Choose well and you can scale linearly to thousands of nodes.
Strategy 1: Range-Based Partitioning
Split data by key ranges. Users 0-9,999,999 on Node 1, 10M-19,999,999 on Node 2, etc.
Partition Map:
Node 1: [0, 10000000)
Node 2: [10000000, 20000000)
Node 3: [20000000, 30000000)
...
Pros:
Simple to implement
Range queries are efficient (”give me users 5M-6M” hits one node)
Easy to understand and debug
Cons:
Sequential IDs create hot spots (all new users hit the highest node)
Manual rebalancing when ranges fill unevenly
Doesn’t account for access patterns (some ranges might be queried 100× more)
Example failure mode: You’re a social network. User IDs are sequential. Celebrity with ID 95,000,000 gets 10M followers. All queries for that user hit Node 10. Node 10 is now handling 80% of traffic while Nodes 1-9 idle. Your “distributed” system has a single point of bottleneck.
MongoDB uses range-based sharding by default, with automatic splitting when chunks grow too large[2]. But you still need to choose a shard key that distributes load evenly.
Strategy 2: Hash-Based Partitioning
Hash the partition key and use the hash to determine the node.
node_id = hash(user_id) % num_nodes
Example:
hash(12345) = 789456123
789456123 % 10 = 3
→ User 12345 lives on Node 3
Pros:
Even distribution regardless of key values
No hot spots from sequential keys
Works well when you don’t need range queries
Cons:
Range queries hit all nodes (”give me users 5M-6M” requires querying all 10 nodes)
Rebalancing when adding nodes is expensive (must rehash all data)
No geographic affinity (users in Germany and Japan might hash to same node)
Example failure mode: You’re an e-commerce platform. You want to query “all orders from the past week” for reporting. With hash sharding, this hits all 100 shards. You’ve just turned a simple query into a distributed fan-out across your entire cluster.
Cassandra uses hash-based partitioning with consistent hashing to minimize rebalancing[3].
Strategy 3: Geography-Based Partitioning
Partition by geographic region or datacenter.
Partition Map:
Node 1-3: US-East users
Node 4-6: EU users
Node 7-9: APAC users
Pros:
Latency is inherently local (EU users query EU nodes)
Meets data residency requirements naturally
Reduces cross-region traffic dramatically
Cons:
Uneven distribution if regions have different user counts
Cross-region queries are expensive
Compliance complexity (what if user moves from EU to US?)
Example failure mode: You launch in Europe first. 90% of users are European. Your US and APAC nodes sit idle while EU nodes are overloaded. You can’t rebalance without violating residency laws.
This is HarperDB’s primary sharding strategy with its sub-database architecture—partition by application component and geography, with explicit control over where data lives[4].
Strategy 4: Composite Partitioning
Combine multiple strategies. Hash within region, or range within hash buckets.
Example: Geography + Hash
1. Determine region from user location → EU
2. Hash user_id within EU shards → Node EU-3
Partition Map:
US shards: 1-10 (hash-based within US)
EU shards: 11-20 (hash-based within EU)
APAC shards: 21-30 (hash-based within APAC)
Pros:
Combines benefits of both approaches
Can optimize for both locality and distribution
Flexible for different workload characteristics
Cons:
Complexity in routing logic
Harder to reason about performance
More edge cases in rebalancing
This is the approach taken by large-scale systems like Facebook’s TAO and Google’s F1[5][6]—geography for locality, hash for distribution.
Consistent Hashing: The Rebalancing Solution
Classic hash partitioning has a fatal flaw: when you add or remove nodes, you must rehash everything.
Start with 10 nodes: hash(key) % 10 Add an 11th node: hash(key) % 11
Now the mapping has changed for nearly every key. User 12345 was on Node 3, now they’re on Node 5. You must migrate ~90% of your data.
Consistent hashing solves this[7]. Instead of hashing keys directly to nodes, hash them to points on a ring:
Ring: 0 to 2^32-1
Hash positions:
Node 1: 100, 500, 900
Node 2: 200, 600, 1000
Node 3: 300, 700, 1100
...
For a key:
hash(key) = 350
→ Walk clockwise to next node
→ Node 2 (at position 600)
Adding Node 11 only affects keys in the arc between Node 10 and Node 1—about 10% of data must move, not 90%.
Virtual nodes (vnodes) improve this further. Each physical node owns multiple positions on the ring:
Node 1: positions [100, 500, 900, 1300, 1700, ...] (256 vnodes)
Node 2: positions [200, 600, 1000, 1400, 1800, ...] (256 vnodes)
This ensures that when a node fails or is added, load redistributes across all remaining nodes, not just its immediate neighbors.
DynamoDB and Cassandra both use consistent hashing with vnodes[3][8]. DynamoDB defaults to ~100 vnodes per node.
Performance impact:
Adding a node: 1/N of data moves (where N is number of nodes)
Removing a node: 1/N of data moves
With 100 nodes, only ~1% of data moves per topology change
This is 90× better than naive hash partitioning.
The Hot Spot Problem
No matter how carefully you partition, real-world access patterns create hot spots—nodes that receive disproportionate traffic.
Hot Spot Cause 1: Celebrity Users
Twitter, 2012: Justin Bieber tweets. 10 million followers see it immediately. All queries for @justinbieber’s timeline hit a single shard. That shard is now handling 100× more traffic than average[9].
Detection: Monitor per-shard query rates. Alert when one shard exceeds 3× median.
Mitigation:
Replicate hot keys: Copy celebrity timelines to multiple shards, load balance reads
Cache aggressively: Hot data should be in application-level cache, not hitting database
Rate limit: Implement per-key rate limiting to prevent one key from monopolizing resources
Hot Spot Cause 2: Time-Based Data
E-commerce site: 90% of queries are for “recent orders” (past 7 days). If you partition by date, the most recent partition is permanently hot.
Mitigation:
Composite key: Partition by (date_bucket, hash(order_id))
Write to multiple partitions: Recent data writes to dedicated “hot” cluster, ages to cold storage
Hot Spot Cause 3: Geographic Events
Olympics in Tokyo. Japanese users spike 10×. Your APAC shards are overwhelmed while US/EU shards idle.
Mitigation:
Temporary replication: Automatically replicate hot Japanese data to nearby regions
Elastic scaling: Add APAC capacity temporarily, remove after event
Read replicas: Spin up read-only replicas in adjacent regions
Rebalancing: The Operational Nightmare
Your sharding is working well. Then growth happens. Node 3 fills up. Or you add more capacity. Or you realize your partition key was suboptimal. Now you need to rebalance—move data from one shard to another.
This is dangerous.
Rebalancing Challenge 1: Availability During Migration
You’re moving 1TB from Node 3 to Node 4. This takes hours. During migration:
Which node answers queries for migrating data?
What happens to writes during migration?
What if the migration fails halfway through?
MongoDB’s approach[2]:
Start background migration (chunk mover)
Node 3 continues serving reads/writes
Node 4 copies data in batches
When ~90% copied, enter brief write-lock phase
Copy final deltas, update routing table
Node 4 now serves traffic
Downtime: ~100-500ms during final switchover. But if migration fails, you must retry—possibly multiple times.
Rebalancing Challenge 2: Cross-Shard Queries During Migration
Your application queries “all users in Europe.” Half of European users are migrating from Node 3 to Node 4. The query must hit both nodes and deduplicate results.
Performance impact: During rebalancing, cross-shard queries are 2× slower (must query extra nodes) and 2× more expensive (higher resource usage).
Rebalancing Challenge 3: Write Amplification During Migration
Every write to migrating data must go to both old and new nodes to maintain consistency. Write amplification temporarily increases from 3× to 6×.
If you’re rebalancing 30% of your data, your cluster-wide write amplification increases by ~30%. At high throughput, this can saturate storage and cause cascading failures.
Real-world incident: A team I worked with tried to rebalance 40% of a 50-node HarperDB cluster during business hours. Write amplification spiked, storage queues filled, query latency went from 10ms to 2,000ms, and they had to abort the migration. Lesson learned: rebalance during low-traffic windows and limit concurrent migrations.
Compliance and Data Residency
Sharding isn’t just about performance—it’s increasingly about compliance. GDPR, CCPA, China’s cybersecurity law, Russia’s data localization law—dozens of regulations require that certain data stay in certain regions[10].
Residency Requirement 1: Data Must Stay In-Region
GDPR: Personal data of EU residents must be processed within the EU (with exceptions for approved countries).
Implementation: Partition by user region. EU users → EU shards. Never replicate EU data outside EU.
Partition Map (Compliant):
EU-1, EU-2, EU-3: EU users only
US-1, US-2, US-3: US users only
APAC-1, APAC-2: APAC users only
Challenge: What if EU user accesses application from US? Query must route to EU shards, adding 80-100ms latency.
Residency Requirement 2: Cross-Border Transfers Require Consent
CCPA: California residents’ data can leave California, but they must be notified and can opt out.
Implementation: Default partition to California shards. Allow replication elsewhere only with consent flag set.
Challenge: Tracking consent per-user, per-data-type, per-destination. Complex access control logic.
Residency Requirement 3: Auditable Access Logs
SOX, HIPAA, PCI-DSS: All access to sensitive data must be logged and auditable.
Implementation: Wrap all queries with audit logging. For sharded systems, this means distributed log aggregation—ensuring logs from all shards are collected and correlated.
Challenge: Log volume scales with number of shards. 100 shards × 10k queries/sec = 1M log entries/sec to process and store.
The Residency-Performance Tension
Here’s the fundamental tension: compliance wants data to stay put, performance wants data to move closer to users.
Example: You’re a SaaS company with EU and US customers. Compliance says EU data stays in EU. But your US operations team needs read access for customer support. Do you:
Replicate to US with encryption/tokenization: Meets performance needs, increases compliance risk
Force US team to query EU shards: Meets compliance, adds 80-100ms latency to every support query
Create read replicas in US with strict access controls: Middle ground, but complex to implement and audit
There’s no perfect answer. Systems like AWS Sovereign Cloud and Azure Confidential Computing attempt to solve this with hardware-level isolation and cryptographic attestation[11][12], but these add cost and complexity.
Adaptive Partitioning: The Self-Tuning Ideal
Static partitioning breaks when access patterns change. What if the system could automatically detect hot spots and rebalance?
DynamoDB’s Adaptive Capacity
DynamoDB monitors per-partition metrics (read/write throughput, storage). When a partition becomes hot, it automatically:
Allocates more capacity to that partition
Splits the partition if it’s too large
Rebalances traffic across partitions[8]
Example: Black Friday. Orders spike 10×. DynamoDB detects the hot partition, allocates more capacity, splits if needed. All automatic, no operator intervention.
Limitations: Only works within DynamoDB’s model. Requires AWS infrastructure. Can’t handle certain hot spot patterns (single extremely hot key).
HarperDB’s Composable Architecture
HarperDB allows explicit control over data placement at the component level. Each “sub-database” can have different replication and partitioning strategies[4].
Example:
User accounts: 3× replicated across all regions (critical, low-write)
Product catalog: 5× replicated (read-heavy)
Shopping carts: Partitioned by user geography, 3× local replication
Analytics logs: Single-copy, streamed to warehouse
This isn’t automatic adaptation, but it gives operators fine-grained control to optimize per-workload.
The Feedback Loop Model
The ideal adaptive system would:
Collect telemetry: Query frequency, data temperature, access geography
Detect patterns: “Orders from region X are hot, accounts from region Y are cold”
Predict optimal placement: “Move hot orders closer to X, consolidate cold accounts”
Execute migrations: Automatically rebalance with minimal disruption
Measure impact: Did latency improve? Did cost decrease?
Repeat: Continuous optimization
This is the “Intelligent Data Plane” concept we’ll explore in Part III—a control layer that treats data placement as a continuous optimization problem, not a one-time architectural decision.
The Partition Key Paradox
Here’s the paradox: to choose a good partition key, you need to understand your access patterns. But access patterns change over time. The partition key that’s optimal today might be terrible in six months.
Example: You’re building a social network. You partition by user_id (hash-based). Initially, queries are “get user profile” (single-shard). The system works great.
Six months later, your killer feature is “show me all posts from my friends” (multi-shard fan-out). Now every query hits 50+ shards. Performance collapses.
To fix this, you need to repartition by post_id or denormalize data—both expensive migrations. The partition key that optimized for phase 1 is wrong for phase 2.
Lesson: Partition keys are technical debt. Choose conservatively. Plan for migration from day one. Monitor access patterns and be ready to repartition.
Some systems try to avoid this trap by using composite keys or maintaining multiple indexes, but this just trades partition key problems for index management problems.
Cross-Shard Queries: The Unavoidable Tax
No matter how clever your partitioning, some queries span shards.
Scenario: “Show me total revenue for the past month”
If revenue data is partitioned by customer_id (for locality), this query must:
Fan out to all shards
Each shard computes its local sum
Coordinator aggregates results
Coordinator → Query all 100 shards
Shard 1: $45,231
Shard 2: $39,877
...
Shard 100: $52,103
Coordinator: Sum = $4,892,445
Latency impact: Query time = slowest shard + aggregation overhead. If 99 shards respond in 10ms but one shard is busy and takes 500ms, your query takes 500ms.
Mitigation strategies:
Pre-aggregate: Maintain a separate aggregation table that’s updated incrementally
MapReduce: Run aggregations as background jobs, not real-time queries
Approximate: Use probabilistic data structures (HyperLogLog, Count-Min Sketch) for fast approximate answers[13]
Cache: If the query is common, cache the result and invalidate when underlying data changes
But there’s no magic solution. Cross-shard aggregations are fundamentally expensive. Design your partition key to minimize them.
The Principle: Placement Must Evolve
The key insight from this chapter: data placement is not a one-time decision.
Your initial partition strategy will be wrong. Not because you made a mistake, but because requirements change:
Data grows (yesterday’s single-node table is tomorrow’s sharded cluster)
Access patterns shift (your read-heavy workload becomes write-heavy)
Geography changes (you launch in new regions)
Regulations evolve (new compliance requirements emerge)
Technology improves (new database features enable better strategies)
Systems that treat sharding as a static architectural decision become brittle. Systems that plan for evolution—with monitoring, migration tools, and clear operational procedures—remain flexible.
In the next chapter, we’ll examine how consistency, availability, and latency interact in sharded systems. We’ll see how CAP theorem and PACELC framework apply to real-world partitioned architectures, and we’ll quantify the millisecond and cost implications of different consistency models.
Because once you’ve sharded your data, you’ve created a distributed system with all its attendant complexity. And distributed systems force you to choose: consistency, availability, or low latency. You can optimize for two, but never all three simultaneously.
References
[1] D. J. DeWitt et al., “The Gamma Database Machine Project,” IEEE Transactions on Knowledge and Data Engineering, vol. 2, no. 1, pp. 44-62, 1990.
[2] MongoDB, “Sharding,” MongoDB Manual, 2024. [Online]. Available: https://docs.mongodb.com/manual/sharding/
[3] A. Lakshman and P. Malik, “Cassandra: A Decentralized Structured Storage System,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, pp. 35-40, 2010.
[4] HarperDB, “Sub-databases and Component Architecture,” Technical Documentation, 2024. [Online]. Available: https://docs.harperdb.io/
[5] N. Bronson et al., “TAO: Facebook’s Distributed Data Store for the Social Graph,” Proc. 2013 USENIX Annual Technical Conference, pp. 49-60, 2013.
[6] J. Shute et al., “F1: A Distributed SQL Database That Scales,” Proc. VLDB Endowment, vol. 6, no. 11, pp. 1068-1079, 2013.
[7] D. Karger et al., “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web,” Proc. 29th Annual ACM Symposium on Theory of Computing, pp. 654-663, 1997.
[8] G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key-value Store,” Proc. 21st ACM Symposium on Operating Systems Principles, pp. 205-220, 2007.
[9] Twitter Engineering, “Handling Scale: Building Twitter,” Twitter Engineering Blog, 2013. [Online]. Available: https://blog.twitter.com/engineering/
[10] European Parliament, “General Data Protection Regulation (GDPR),” Official Journal of the European Union, 2016.
[11] AWS, “AWS Sovereign Cloud,” AWS Documentation, 2024. [Online]. Available: https://aws.amazon.com/sovereign-cloud/
[12] Microsoft, “Azure Confidential Computing,” Microsoft Azure Documentation, 2024. [Online]. Available: https://azure.microsoft.com/en-us/solutions/confidential-compute/
[13] P. Flajolet et al., “HyperLogLog: The Analysis of a Near-optimal Cardinality Estimation Algorithm,” Discrete Mathematics and Theoretical Computer Science, pp. 137-156, 2007.
Next in this series: Chapter 7 - Consistency, Availability, and Latency in Practice, where we’ll move beyond CAP theorem abstractions and quantify the real-world trade-offs of different consistency models in sharded systems.

