How Discord rebuilt message storage before MongoDB's single replica set hit its limit

At 100 million stored messages, a single MongoDB replica set stopped fitting in memory — and a year later, a near-empty channel almost took the new database down too.

DISCORD · 2017 · DATABASES / SCALE

System stress over time Breach at T+7

120M+ Messages/day at writing

12 nodes Cluster size

<1ms Write latency

<5ms Read latency

BASELINE

A Database Built for Speed, Not Scale

Discord shipped its first version in under two months in early 2015. For that kind of speed, MongoDB was the obvious pick — everything lived in one MongoDB replica set, with messages indexed by channel_id and created_at.

That was a deliberate trade, not an oversight. The team knew they weren’t going to lean on MongoDB’s sharding long-term — it was complicated and not known for stability — so they built with an easy path to migrate later. Ship fast now, get robust when it actually matters.

T+0 (NOV 2015)

The Climb

Around 100 million stored messages, the warning signs arrived on schedule: the data and its index no longer fit in RAM, and latencies turned unpredictable. It was time to move.

DECISION

Seven Requirements, One Database

The team wrote down what they actually needed: linear scalability without manual resharding, automatic failover, low maintenance, proven technology, predictable performance (alarms fired if the API’s p95 crossed 80ms), no blob-store deserialization tax, and open source.

MODELING

When 'It Can' Doesn't Mean 'It Should'

Cassandra’s compound key gave Discord a partition key (channel_id) and a clustering key. Message IDs on Discord are Snowflakes — chronologically sortable — so message_id became that clustering key, letting Cassandra range-scan a channel directly instead of guessing.

Then the import hit a wall: log warnings about partitions exceeding 100MB, despite Cassandra advertising support for 2GB partitions. Large partitions pile garbage-collection pressure onto compaction and cluster expansion, and they can’t be spread across the cluster. A single Discord channel can run for years — that partition only grows.

DARK LAUNCH

A Required Field, Suddenly Null

Discord dual-wrote to MongoDB and Cassandra in production before fully switching over. Almost immediately, the bug tracker filled with author_id showing up null — on a field that was supposed to be required.

That fix surfaced a second, quieter problem: deletes in Cassandra are themselves writes — “tombstones,” kept around for replication before being purged on compaction. Writing null generates a tombstone exactly like a delete does. With 16 columns in the schema but only 4 typically set per message, Discord was writing 12 needless tombstones per message. The fix: only write non-null values.

ROLLOUT

Flawless — For Six Months

Performance matched expectations exactly: sub-millisecond writes, sub-5-millisecond reads, consistent regardless of which data was being touched. Cassandra became the primary database within a week of launch, and MongoDB was phased out.

T+6MO

The Big Surprise

Six months in, Cassandra went unresponsive — 10-second “stop-the-world” garbage collection pauses, with no obvious cause. The trail led to one channel taking 20 seconds to load: the Puzzles & Dragons Subreddit’s public Discord, a server with exactly one visible message.

AFTERMATH

A Year Later

A year past the switch, Discord was running a 12-node cluster with a replication factor of 3, still just adding nodes as data grew. Despite the tombstone incident, total volume had grown from 100 million stored messages to over 120 million sent per day, with performance holding steady. The team — four backend engineers, no dedicated DevOps — called a low-maintenance database one of the project’s biggest wins.

Source — read the original

https://discord.com/blog/how-discord-stores-billions-of-messages

A plain-language, AI-drafted and human-edited retelling of the article published on discord.com, reorganized and explained in our own structure and words, with original analysis in the editor's note above. The facts, numbers, and decisions belong to the original author and are not altered. For the full depth, read the source.

← All systems