Chapter 10 – Data Gravity and Motion
The Dynamic Relationship Between Compute and Storage
In Chapter 9, we explored adaptive storage—systems that move data between tiers based on observed access patterns. But there’s a deeper question lurking beneath: why move the data at all? Why not move the compute instead?
This isn’t a new idea. The principle “move compute to data, not data to compute” has been a mantra in distributed systems for decades[1]. It’s the foundation of MapReduce, Hadoop, and Spark. The reasoning is simple: moving a few kilobytes of code is cheaper than moving terabytes of data.
But here’s what’s changed: in modern cloud infrastructure, both data and compute are fluid. Containers spin up in seconds. Serverless functions deploy globally in minutes. Object storage replicates across regions automatically. The question is no longer “should we move data or compute?” but rather “which one should move, when, and by how much?”
This chapter introduces data gravity—the concept that data and compute exert mutual attraction. Heavy data pulls compute toward it. Heavy compute pulls data toward it. The optimal architecture isn’t static placement of both, but dynamic equilibrium where each moves in response to the other.
We’ll model this mathematically, simulate it, and discover that static placement wastes 30-60% of potential efficiency.
The Traditional View: Data Has Gravity, Compute Moves
The original concept of data gravity comes from Dave McCrory (2010): “As data accumulates, it becomes harder to move. Applications and services are naturally attracted to large datasets”[2].
The physics analogy: Data is like a planet. The more data you have, the stronger its gravitational pull. Applications orbit around data.
Real-world example: Enterprise data warehouse with 500TB of customer data. Where do you run your analytics? You run them where the data lives. Moving 500TB to your analytics cluster would take days and cost thousands in bandwidth. Moving your analytics code (megabytes) to the data takes seconds.
This view led to architectures like:
Hadoop: Store data on HDFS, run MapReduce jobs where data lives
Snowflake: Centralized data warehouse, compute elastically scales at the data location
Databricks: Data lake with compute clusters co-located with storage
The implicit assumption: Data is heavy and immovable. Compute is light and mobile. Always move compute to data.
The Problem: This Only Works When Data Has One Center of Gravity
The traditional model assumes data has a single location—one massive data warehouse, one Hadoop cluster, one data lake. But modern applications don’t work that way.
Scenario 1: Multi-Region Application
You run a global SaaS application. You have:
1M users in North America
500k users in Europe
300k users in Asia-Pacific
Where should the data live? There’s no single “center of gravity.” Users are distributed.
Traditional approaches:
Put data in one region: NA users get 5ms queries, EU users get 100ms, APAC users get 150ms. Bad experience for 800k users.
Replicate everywhere: All users get 5ms queries but you pay 3× storage and write amplification costs.
Partition by region: NA users’ data in NA, EU in EU, APAC in APAC. Works until you need cross-region queries.
The problem: Data has multiple centers of gravity, not one.
Scenario 2: Temporal Patterns
Your application has daily cycles. During US business hours, 80% of queries come from US regions. During APAC business hours, 80% come from APAC.
Static placement: Choose one region for data. Half the day, most queries are cross-region and slow.
The problem: Center of gravity moves over time.
Scenario 3: Compute-Intensive Workloads
You’re running ML inference on images. Each image is 5MB. Processing requires 10 seconds of GPU time. You have 1M images to process.
Traditional logic: Images are heavy (5TB total), code is light (megabytes). Move compute to data.
But: GPUs are scarce and expensive. You have a GPU cluster in us-west-2, but images are distributed across all regions.
Do you:
Move 5TB of images to us-west-2? (Bandwidth: $400, time: hours)
Move inference code to each region? (No GPUs available in those regions)
The problem: Compute has its own gravity—availability, cost, and specialization.
The Synthesis: Bidirectional Gravity
What if we model gravity as bidirectional? Data attracts compute. Compute attracts data. The system should optimize for the total cost of moving both.
Data gravity factors:
Size: Larger data is harder to move (bandwidth cost, time)
Update frequency: Frequently updated data is harder to keep synchronized
Regulatory constraints: Some data cannot move (GDPR, residency laws)
Compute gravity factors:
Resource requirements: GPUs, specialized hardware limited to certain locations
Cost: Compute costs vary by region (us-west-2 often cheapest)
Scalability: Can you deploy compute anywhere, or only in specific regions?
The optimization problem: Minimize total latency and cost by optimally placing both data and compute.
Mathematical Model: Vector Fields of Attraction
Let’s formalize this with a simplified mathematical model.
Data gravity at location L:
G_data(L) = Σ (data_size_i × access_frequency_i × (1 / distance_to_L))
Where:
data_size_i: Size of data object i (GB)access_frequency_i: Queries per hour to object idistance_to_L: Geographic distance to location L (km)
This creates a gravity field. Locations with lots of data being accessed heavily have high gravity.
Compute demand at location L:
D_compute(L) = Σ (query_frequency_from_L × compute_required_per_query)
Where:
query_frequency_from_L: Queries originating from location L per hourcompute_required_per_query: CPU/memory/GPU required per query
This creates a demand field. Locations generating lots of queries have high compute demand.
Net force on data object i:
F_data(i, L) = D_compute(L) × (1 / data_size_i) × (1 / distance_to_L)
Data is pulled toward high-demand locations, inversely proportional to its size (heavy data moves less).
Net force on compute workload w:
F_compute(w, L) = G_data(L) × compute_efficiency(L) × cost_factor(L)
Compute is pulled toward high-gravity locations, weighted by efficiency and cost.
Equilibrium: The system reaches equilibrium when net forces are balanced. In practice, this means:
Heavy, rarely-accessed data stays put (high inertia)
Light, frequently-accessed data replicates to demand locations (low inertia)
Compute deploys near heavy data when data can’t move
Compute deploys in optimal cost regions when data can move
Simulation: Shifting Centers of Gravity
Let’s simulate a realistic scenario to see how gravity shifts.
Setup:
Global application with 3 regions: US, EU, APAC
100GB dataset initially in US region
Query patterns change throughout the day (time zones)
Hour 0-8 (US Business Hours):
Query sources:
US: 10,000 queries/hour
EU: 1,000 queries/hour
APAC: 500 queries/hour
Data gravity: Centered in US (data lives there)
Compute demand: Highest in US
Optimal placement:
Data: US
Compute: US
Average query latency: 8ms (90% local, 10% cross-region)
Hour 8-16 (EU Business Hours):
Query sources:
US: 2,000 queries/hour
EU: 12,000 queries/hour
APAC: 1,000 queries/hour
Data gravity: Still in US (data hasn’t moved)
Compute demand: Highest in EU
Sub-optimal placement:
Data: US
Compute: US (following data)
Average query latency: 85ms (80% cross-region US-EU)
Better placement:
Data: Replicate hot subset (20GB) to EU
Compute: Move to EU
Average query latency: 12ms (75% local EU, 20% local US, 5% cross-region)
Hour 16-24 (APAC Business Hours):
Query sources:
US: 500 queries/hour
EU: 1,000 queries/hour
APAC: 8,000 queries/hour
Optimal placement:
Data: Replicate hot subset (15GB) to APAC
Compute: Move to APAC
Average query latency: 15ms
Static vs Dynamic comparison (24-hour average):
Static placement (data and compute always in US):
Average latency: 52ms
Storage cost: $100/month (100GB in US)
Compute cost: $500/month (running in US)
Bandwidth cost: $50/month (cross-region queries)
Total cost: $650/month, Average latency: 52ms
Dynamic placement (data and compute follow gravity):
Average latency: 12ms (4.3× faster)
Storage cost: $140/month (100GB primary + replicas)
Compute cost: $480/month (efficiency gains from better placement)
Bandwidth cost: $30/month (less cross-region traffic)
Migration cost: $20/month (moving compute, replicating data)
Total cost: $670/month (+3%), Average latency: 12ms (-77%)
The gravity insight: Spending an extra 3% on infrastructure to follow gravity reduces latency by 77%.
Real-World Example: Cloudflare Workers with Durable Objects
Cloudflare’s architecture demonstrates dynamic compute-data placement[3].
Traditional model:
User in Tokyo queries application
Query routes to Tokyo edge server
Edge server calls centralized database in US
Total latency: 150-200ms (Tokyo → US → Tokyo)
Cloudflare Workers + Durable Objects:
User in Tokyo queries application
Query hits Tokyo edge server running Workers (compute)
Durable Object for this user lives in... where?
The gravity optimization:
Initially, Durable Object might be in US (created there)
System observes: 90% of queries come from Tokyo
System migrates Durable Object to Tokyo region
Now: Query hits Tokyo edge server, accesses local Durable Object
Total latency: 5-10ms (all local)
The key: Both compute (Workers) and data (Durable Objects) can move. The system migrates the Durable Object to follow the query pattern[3].
Cloudflare’s Implementation: Automatic Migration
Cloudflare’s system automatically migrates Durable Objects based on access patterns:
Telemetry collected:
Query frequency per region
Latency per region
Data size (inertia factor)
Migration logic:
IF 80%+ of queries come from region R
AND current location ≠ R
AND migration_cost < latency_savings_value
THEN migrate to region R
Migration process:
Detect pattern shift (sustained for 5+ minutes)
Allocate Durable Object in new region
Pause writes (brief lock, ~100ms)
Copy state to new location
Update routing (redirect to new location)
Resume writes
Delete old location
Downtime: ~100-500ms during migration
Result: Objects automatically follow users. European user’s shopping cart lives in EU. Asian user’s cart lives in APAC[3].
The Anti-Pattern: Fighting Gravity
Many systems fight gravity instead of following it. This wastes resources and hurts performance.
Anti-Pattern 1: Centralized Data, Distributed Users
Startup begins with all data in us-east-1 (AWS default). Grows globally. European customers complain about latency.
Wrong solution: “Just use a CDN for static assets. Database queries are fast enough.”
Reality: Database queries from EU to US add 80-120ms. Users notice. Conversion rates drop.
Gravity-aware solution: Replicate EU users’ data to EU region. Partition by geography.
Anti-Pattern 2: Compute Pinned by Configuration
Infrastructure-as-code hardcodes compute regions:
# terraform.tfvars
region = “us-west-2”
Team never changes it. Even as user distribution shifts, compute stays in us-west-2.
Gravity-aware solution: Auto-scaling policies that deploy compute where demand is highest.
Anti-Pattern 3: Over-Replication
“Let’s replicate everything everywhere to minimize latency!”
Result: 10× write amplification, massive costs, marginal latency improvement for rarely-accessed data.
Gravity-aware solution: Replicate only hot data to high-demand regions. Cold data stays in primary region.
Quantifying the Waste: Static Placement Inefficiency
Let’s model a real-world scenario to quantify waste.
Application:
1TB dataset
1M users distributed: 40% US, 35% EU, 25% APAC
Workload: 100k queries/hour average
Static placement (all data in US):
Query latencies:
US queries (40k/hour): 5ms average
EU queries (35k/hour): 90ms average
APAC queries (25k/hour): 120ms average
Weighted average latency:
(40k × 5ms + 35k × 90ms + 25k × 120ms) / 100k = 47.15ms
Cost:
Storage: $200/month (1TB in US)
Compute: $1,000/month (US region)
Bandwidth: $300/month (cross-region queries)
Total: $1,500/month
Optimal dynamic placement:
Data placement (based on access patterns):
US: 500GB (hot US data) + 200GB (EU replica of hot EU data) + 100GB (APAC replica) = 800GB
EU: 400GB (hot EU data) + 100GB (US replica) = 500GB
APAC: 300GB (hot APAC data) + 50GB (US replica) = 350GB
Compute placement:
US: 40% capacity
EU: 35% capacity
APAC: 25% capacity
Query latencies:
US queries: 5ms average (local)
EU queries: 8ms average (local for hot data, 90ms for cold ~5%)
APAC queries: 10ms average (local for hot data, 120ms for cold ~10%)
Weighted average latency:
(40k × 5ms + 35k × 8ms + 25k × 10ms) / 100k = 7.3ms
Cost:
Storage: $330/month (1.65TB total with replication)
Compute: $950/month (distributed, slight efficiency gains)
Bandwidth: $80/month (less cross-region traffic)
Migration: $40/month (continuous optimization)
Total: $1,400/month
Comparison:
Static placement:
Latency: 47.15ms
Cost: $1,500/month
Dynamic placement:
Latency: 7.3ms (6.5× faster)
Cost: $1,400/month (7% cheaper)
The waste: Static placement is both slower AND more expensive. It wastes:
85% of potential latency improvement
7% unnecessary cost
Why?: Fighting gravity. Forcing 60% of queries to cross regions unnecessarily.
The Feedback Loop: Gravity Responds to Movement
Here’s where it gets interesting: when you move data or compute, you change the gravity field.
Example:
Initial state:
Data in US: 1TB
High query load from EU: 50k queries/hour
EU has high compute demand gravity
System considers: Should we replicate to EU?
If we replicate:
Data now in US (1TB) and EU (1TB replica)
EU queries become local: latency drops 5ms → 85ms
EU compute demand gravity decreases (queries satisfied locally)
US-EU bandwidth decreases
Write amplification increases (2× writes)
New equilibrium:
Lower latency overall
Higher storage cost
Lower bandwidth cost
System continuously monitors: Is this still optimal?
If EU query load drops (users churn, time zone shift):
EU compute demand gravity decreases further
System considers: Should we stop replicating to EU?
If yes, removes EU replica
New equilibrium with lower cost, acceptable latency
The insight: Gravity is not static. It’s a dynamic equilibrium that responds to placement decisions.
The Three Laws of Data Gravity
Drawing from our analysis, we can formulate three laws:
First Law (Newton’s First): Data and compute at rest stay at rest. Data and compute in motion stay in motion, unless acted upon by gravity.
Practical meaning: Static placement persists unless there’s a strong signal to change. Systems should have hysteresis (resistance to change) to avoid thrashing.
Second Law (Newton’s Second): The force of gravity on an object is proportional to demand and inversely proportional to mass.
Practical meaning: Light data with high demand moves easily. Heavy data with low demand stays put. Compute moves more easily than data (lower mass).
Third Law (Newton’s Third): For every movement of data toward compute, there’s an equal and opposite movement of compute toward data.
Practical meaning: Optimal systems balance both. Sometimes data moves to compute. Sometimes compute moves to data. Often, both move partially.
Implementation Pattern: The Gravity Orchestrator
How do you build a system that follows gravity?
Architecture components:
1. Telemetry Collector
Collect per-object metrics:
- access_frequency (queries/hour)
- query_sources (region breakdown)
- data_size (GB)
- last_accessed (timestamp)
- migration_history (previous locations)
2. Gravity Calculator
FOR EACH data_object:
FOR EACH region:
compute_gravity[region] = query_frequency[region] / distance[region]
max_gravity_region = argmax(compute_gravity)
current_region = data_object.location
IF max_gravity_region ≠ current_region:
improvement_score = compute_gravity[max_gravity_region] - compute_gravity[current_region]
migration_cost = data_size × bandwidth_cost + migration_downtime_cost
IF improvement_score > migration_cost × threshold:
schedule_migration(data_object, max_gravity_region)
3. Migration Executor
WHILE migration_queue not empty:
migration = pop_highest_priority(migration_queue)
IF in_maintenance_window() AND below_migration_rate_limit():
execute_migration(migration)
measure_impact(migration)
IF impact_positive():
log_success(migration)
ELSE:
rollback(migration)
block_similar_migrations_temporarily()
4. Feedback Monitor
FOR EACH completed_migration:
measure:
- latency_before vs latency_after
- cost_before vs cost_after
- query_pattern_changes
IF metrics_improved():
reinforce_migration_policy()
ELSE:
adjust_migration_threshold()
This is the skeleton of an Intelligent Data Plane—a system that continuously optimizes placement based on observed gravity.
Looking Ahead: Predictive Gravity
Everything we’ve discussed so far is reactive. The system observes patterns, calculates gravity, and responds.
But what if we could predict gravity changes before they happen?
Scenario: Your application sees a regular daily pattern:
8 AM US time: US queries spike
4 PM US time: EU queries spike
12 AM US time: APAC queries spike
A reactive system waits for the spike, detects the pattern, then migrates data. By the time migration completes, the spike might be over.
A predictive system learns the pattern and migrates proactively:
7:45 AM: Predict US spike, pre-migrate compute to US
3:45 PM: Predict EU spike, pre-migrate EU user data to EU
11:45 PM: Predict APAC spike, pre-migrate APAC user data to APAC
The advantage: Zero latency during pattern shift. Data is already where it needs to be.
This is the topic of Chapter 11: Vector Sharding and predictive data movement. We’ll explore algorithms that model data distribution as multidimensional vectors and predict optimal placement before demand materializes.
But the foundation is here: understanding that data and compute both have gravity, that gravity shifts dynamically, and that optimal systems follow gravity rather than fighting it.
Static placement wastes 30-60% of potential efficiency. Dynamic placement recovers that waste. Predictive placement takes it further.
References
[1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
[2] D. McCrory, “Data Gravity: The Importance of Understanding the Implications,” Data Center Knowledge, 2010. [Online]. Available: https://www.datacenterknowledge.com/
[3] Cloudflare, “Durable Objects: Easy, Fast, Correct — Choose Three,” Cloudflare Blog, 2020. [Online]. Available: https://blog.cloudflare.com/durable-objects-easy-fast-correct-choose-three/
[4] A. Verma et al., “Large-scale Cluster Management at Google with Borg,” Proc. 10th European Conference on Computer Systems, pp. 1-17, 2015.
[5] M. Chowdhury et al., “Managing Data Transfers in Computer Clusters with Orchestra,” Proc. ACM SIGCOMM Conference, pp. 98-109, 2011.
[6] G. Ananthanarayanan et al., “Effective Straggler Mitigation: Attack of the Clones,” Proc. 10th USENIX Symposium on Networked Systems Design and Implementation, pp. 185-198, 2013.
[7] Netflix Technology Blog, “Active-Active for Multi-Regional Resiliency,” 2013. [Online]. Available: https://netflixtechblog.com/
Next in this series: Chapter 11 - Vector Sharding: Predictive Data Movement, where we’ll introduce algorithms that model data distribution as multidimensional vectors and predict optimal placement ahead of demand—the culmination of the Intelligent Data Plane vision.

