Database Sharding: A Deep Dive

What sharding actually is

Sharding is horizontal partitioning of data across multiple independent database instances. Each shard holds a disjoint subset of the rows, runs on its own machine (or cluster), and is queried independently. A user record either lives on shard 3 or shard 7, never both. This is fundamentally different from replication, where every node holds the same data for redundancy. Sharding is about splitting the data; replication is about copying it. Most production systems combine both: each shard is itself replicated for HA.

The reason sharding exists is brutally simple: a single machine has limits. You can scale vertically (bigger CPU, more RAM, faster NVMe) until the bill becomes absurd or the hardware simply doesn't exist at the size you need. Once your working set exceeds what fits in RAM, or your write throughput exceeds what one disk can sustain, or your single primary becomes a CPU bottleneck on query planning, you have to split. Sharding is the answer when read replicas alone can't save you — which is specifically the case for write-heavy or storage-heavy workloads, because replicas help reads but every replica still has to absorb every write.

When you actually need it (and when you don't)

The honest answer most senior engineers give: shard later than you think. A well-tuned Postgres on modern hardware can handle tens of TBs and tens of thousands of QPS. Adding sharding adds enormous operational complexity, so the rule of thumb is to exhaust vertical scaling, caching (Redis, materialized views), read replicas, partitioning within a single database (Postgres native partitioning, for instance), and query optimization before reaching for shards.

Real use cases where sharding becomes unavoidable: multi-tenant SaaS at very large scale (Salesforce, Shopify), social graphs and feeds (Facebook, Instagram), messaging (WhatsApp, Discord), financial ledgers with high write volumes, IoT/telemetry workloads writing billions of events per day, and gaming backends. The common thread is either huge data volume, very high write throughput, strict latency requirements at scale, or some combination.

Shard key selection — the most consequential decision

The shard key (or partition key) is the column or set of columns used to decide which shard a row lives on. This decision is borderline irreversible in practice. Changing the shard key later requires a full data migration across every shard, which on a multi-TB system can take weeks of careful engineering work. Choose carefully.

A good shard key has four properties. First, high cardinality — many distinct values. If you shard a SaaS app by country_code, you get at most ~250 distinct values, and most traffic will concentrate in a handful of countries. You can't have more shards than distinct key values, and uneven value frequency creates hotspots. Compare with user_id (millions of values) or tenant_id (thousands to millions in a big SaaS). High cardinality gives you room to grow and to distribute load.

Second, even distribution. Cardinality alone isn't enough — values must also be roughly uniformly distributed. created_at has infinite cardinality but terrible distribution: all today's writes land on whichever shard owns today's range. This is the classic time-based hotspot. The shard owning recent data gets hammered while older shards sit idle.

Third, query alignment. The shard key should match the access pattern. If 95% of your queries are "give me everything for user X," sharding by user_id means those queries hit exactly one shard — a single-shard query, cheap and fast. If you instead shard by order_id but always query by user_id, every read becomes a scatter-gather across all shards. That's a disaster for latency.

Fourth, stability. The key shouldn't change for a given logical entity. Sharding by email is risky because users change emails. Sharding by an immutable surrogate ID is safer.

The canonical examples: in a B2B SaaS, shard by tenant_id or account_id. In a consumer social app, shard by user_id. In a messaging app, shard by conversation_id (Discord went this route — they shard by guild_id because guild-scoped queries dominate). In an e-commerce app, sharding by customer_id works for the customer-facing side but creates pain for the merchant-facing side, which often wants to query across customers — many companies end up with two sharded clusters keyed differently for the two sides.

Sharding strategies

Range-based sharding assigns contiguous ranges to shards: users with IDs 1–1M on shard 1, 1M–2M on shard 2, etc. Easy to reason about, easy to do range scans, but prone to hotspots when IDs are monotonically increasing (newest users get all the action) and prone to imbalance when ranges fill up unevenly.

Hash-based sharding computes hash(key) % N to pick the shard. Distribution is excellent — hash functions spread values uniformly — but you lose the ability to do efficient range scans, because adjacent keys land on random shards. This is the most common approach for OLTP workloads.

Directory-based sharding keeps a lookup table mapping keys (or key ranges) to shards. Very flexible — you can move a specific tenant to a dedicated shard if it grows huge — but the directory itself becomes a critical, hot piece of infrastructure that must be cached aggressively and kept highly available. Vitess uses something like this. Many SaaS systems do this to handle "whale" tenants.

Geo-sharding routes by geography — EU users on EU shards, US on US shards. Useful for data residency (GDPR) and latency, but cross-region queries are expensive and cross-region transactions are painful.

Consistent hashing — why it matters

Naive hash sharding uses hash(key) % N. The problem: when N changes (you add or remove a shard), the modulo result changes for nearly every key, so almost every row needs to move. Adding one shard to a 10-shard cluster shouldn't require remapping 90% of your data, but with mod N it does.

Consistent hashing solves this. You imagine a circular hash space (say, 0 to 2^32). Each shard is placed at one or more points on the ring (typically many "virtual nodes" per physical shard, for distribution smoothness — usually 100–200 vnodes per shard). To find which shard owns a key, hash the key and walk clockwise to the next shard point. When you add a new shard, you place its vnodes on the ring, and only the keys that fall in the new vnodes' arcs need to move — roughly 1/N of the data instead of nearly all of it.

Cassandra, DynamoDB, Riak, and memcached clients all use consistent hashing under the hood. If you're rolling your own sharding layer, consistent hashing with virtual nodes is essentially mandatory for any system that will ever need to add or remove shards — which is to say, all of them.

Joins and complex queries — where pain lives

This is where sharding hurts the most, and it's the reason teams delay sharding as long as possible.

If both tables in a join are sharded by the same key, joins on that key stay local to a single shard. Sharding users and orders by user_id means JOIN users ON orders.user_id = users.id WHERE user_id = 42 is a single-shard query. Beautiful.

The moment your query touches a different key — say you need to join orders with products, where products are sharded by product_id — you have a cross-shard join. There's no efficient way to do this in a sharded SQL system. The options are all bad: scatter-gather (query every shard, do the join at the application/coordinator layer, hope the intermediate result fits in memory), denormalize aggressively (copy product attributes into the orders row to avoid the join), or replicate small dimension tables to every shard (works if they're truly small and rarely updated — country codes, product categories, etc.).

Aggregations across shards have the same problem. SELECT COUNT(*) FROM orders WHERE status = 'pending' has to ask every shard and sum results. SELECT user_id, SUM(amount) FROM orders GROUP BY user_id ORDER BY 2 DESC LIMIT 10 is much worse — you can't trust each shard's local top-10 to give you the global top-10, so you have to pull more data than you'd think.

Most companies handle this by maintaining a separate analytical store (a columnar warehouse — Snowflake, BigQuery, ClickHouse, Redshift) fed by CDC from the OLTP shards. The OLTP system answers single-shard transactional queries; the warehouse answers everything cross-cutting. This dual-store pattern is standard at scale.

Transactions are another casualty. Single-shard transactions work fine. Multi-shard transactions require two-phase commit or distributed transaction protocols (Spanner's TrueTime, Calvin, Percolator-style), all of which are slow and operationally heavy. Most production sharded systems either avoid multi-shard transactions entirely by careful data modeling, or accept eventual consistency for cross-shard operations.

Operational challenges in real life

Hotspots are the most common failure mode. One tenant grows 100x larger than the others. One key gets disproportionate writes (the "celebrity user" problem). Your shard distribution is statistically even, but one shard is melting while others are idle. Solutions involve splitting the hot key (sub-sharding within a tenant), salting the key (appending a random suffix to spread writes), or moving the hot tenant to its own dedicated shard via directory-based routing.

Rebalancing is the constant background tax. As data grows and shards fill, you need to move data around. With consistent hashing this is bounded, but it's still tons of network I/O, and you have to maintain correctness during the move — typically by dual-writing to old and new locations, backfilling, switching reads, and finally cutting writes. Vitess's VReplication and Citus's shard rebalancer automate much of this; if you're DIY, you're writing this yourself, and it is genuinely hard.

Schema migrations become an N-way coordination problem. Adding a column requires the migration to run on every shard, ideally without long locks. Online schema change tools (gh-ost, pt-online-schema-change) help but you still need to run them everywhere and verify consistency. A failed migration on one shard out of fifty is a special kind of nightmare.

Backups and restores scale linearly with the number of shards, and point-in-time consistency across shards requires careful snapshot coordination. Restoring a single shard at the wrong point in time relative to the others can produce a database that's internally inconsistent in ways that take days to debug.

Routing layer — something has to know which shard owns which key. Either the application embeds this logic (simple, but every service needs to be kept in sync), or you put a proxy in front (Vitess's vtgate, ProxySQL, custom routers). The proxy is now a critical, performance-sensitive component that needs its own HA story.

Observability gets harder. A slow query might be slow on only one shard. Tail latency across shards multiplies — if each shard has p99 of 10ms but a query fans out to 20 shards, the effective p99 is the max of 20 samples, which is much worse than 10ms.

What happens when shards grow

You'll cross several thresholds. The first is when a single shard hits its capacity (storage or IOPS or CPU). You split that shard into two. With consistent hashing, this is "add a new shard, redistribute about 1/(N+1) of the data, done." With range partitioning, you pick a midpoint and split the range. Either way, you need a live migration mechanism: dual-writes during cutover, verification, atomic switchover.

The second threshold is when the number of shards becomes operationally unwieldy. Managing 10 shards is fine. Managing 1,000 means you need automation for everything — provisioning, schema changes, monitoring, failover, capacity planning. Companies at this scale build internal control planes around tools like Vitess, Citus, or YugabyteDB, or move to natively distributed databases (Spanner, CockroachDB, TiDB) where the sharding is handled by the database itself.

The third threshold is when cross-shard query patterns evolve faster than your shard key allows. You sharded by user_id two years ago; now the product team wants features that aggregate across users in real time. This is where teams introduce CDC pipelines, search indexes (Elasticsearch), and OLAP stores — accepting that the sharded OLTP store can no longer answer all questions and a polyglot persistence layer is the way forward.

The fourth threshold, more rarely hit, is needing to change the shard key itself. This is the hardest migration in databases short of cross-database moves. You build a new sharded cluster with the new key, dual-write to both, backfill historical data, gradually migrate reads, verify, and eventually retire the old cluster. This is a multi-quarter project at any nontrivial scale.

Pros and cons, plainly

The pros are basically one big thing: near-linear horizontal scalability of writes, storage, and throughput. If your workload is well-aligned with the shard key, you can grow more or less indefinitely just by adding shards. You also get fault isolation — one shard going down only affects users on that shard, not the entire user base. You get geographic placement options. You get smaller per-shard datasets that are individually easier to back up, vacuum, reindex, and restore.

The cons are many and real: cross-shard queries are expensive or impossible, transactions across shards require expensive distributed protocols, joins force denormalization, schema changes are now N-way operations, rebalancing is constant operational work, the routing layer is a new critical dependency, observability is harder, shard key choice is nearly irreversible, hotspots will eventually happen, and you need significantly more senior engineering effort to operate the system than a single-instance database would require.

The practical mental model

Think of sharding as buying horizontal scale with the currency of operational complexity and query flexibility. The trade is worth it when you've genuinely outgrown single-node options. It's an expensive trade when you make it too early — you'll burn engineering years on infrastructure that solves a problem you don't yet have. The healthiest pattern at most companies is: scale vertically first, add read replicas, add caching, partition within a single DB, then introduce sharding only when there's no other way forward. And when you do shard, invest serious thought in the key, because that decision will outlive most of the team that made it.

TL;DR.

Sharding solves a real problem but creates many new ones. The shard key is your most important decision and the hardest to change. Delay sharding until you've exhausted vertical scaling, caching, and read replicas — and when you do shard, design for the access patterns you actually have, not the ones you imagine you'll have.

Database Sharding: A Deep Dive

What sharding actually is

When you actually need it (and when you don't)

Shard key selection — the most consequential decision

Sharding strategies

Consistent hashing — why it matters

Joins and complex queries — where pain lives

Operational challenges in real life

What happens when shards grow

Pros and cons, plainly

The practical mental model

TL;DR.

Comments

System Design

More from this blog

Velero: Kubernetes Backup & Restore Solution

K3d: Lightweight Kubernetes with K3s in Docker

Helmfile: The Complete Guide

K3s: The Lightweight Kubernetes for the Edge and IoT

Command Palette

What sharding actually is

When you actually need it (and when you don't)

Shard key selection — the most consequential decision

Sharding strategies

Consistent hashing — why it matters

Joins and complex queries — where pain lives

Operational challenges in real life

What happens when shards grow

Pros and cons, plainly

The practical mental model

TL;DR.

Comments

System Design

More from this blog