How We Test Replication with TLA+

← Back to blog

Distributed replication bugs are the most expensive bugs to debug in production: they usually depend on timing, retries, partitions, and reordering. Unit tests and integration tests are still essential, but they are examples. TLA+ lets us validate the protocol itself across many possible interleavings before those bugs hit real clusters.

What we model in kcore

In kcore, we model replication behavior in three TLA+ specs under specs/tla/:

ControllerNodeReconcile.tla for node/controller reconciliation behavior.
ControllerReplication.tla for single-DC event replication and merge semantics.
CrossDcReplication.tla for cross-DC anti-entropy and eventual convergence.

We run bounded model checking in CI, plus a trace bridge that validates Rust-generated replication traces against model invariants.

How TLA+ catches bugs that normal tests miss

TLA+ explores state transitions, not just expected scripts. That means it can surface bugs like:

Conflict paths that can leave a resource in non-terminal or manual-only states.
Convergence failures where replicas can diverge under certain delivery orders.
Broken compensation semantics where an auto-compensated conflict has no valid loser event.
Edge cases in anti-entropy where retries and dedupe interact unexpectedly.

A key practical benefit is speed of feedback: we fix the model and invariants before writing or finalizing implementation details, which reduces high-risk rework later.

How this fits with runtime testing

We treat this as layered verification:

Model level: bounded TLC checks for protocol safety/liveness properties.
Trace level: replication trace checks to prevent model/runtime drift.
Runtime level: controller replication tests and soak runs under retries.

This combination helps catch both design bugs and implementation regressions.

Who else uses TLA+

TLA+ is not an academic-only tool. Public engineering writeups describe production use by teams at:

Amazon (S3, DynamoDB, EBS and others) to find subtle distributed-system bugs early.
Microsoft (Azure teams) for protocol and system design validation.
Cockroach Labs for parts of distributed database protocol verification.

The pattern is consistent: model check critical protocols early, then keep models close to implementation as systems evolve.

Where to learn TLA+ quickly

Leslie Lamport's TLA+ page (official starting point).
Learn TLA+ (practical, beginner-friendly guide).
Learn TLA+ intro for first model-checking workflow.
Hillel Wayne articles for practical modeling style.
TLA+ conference talks for real-world case studies.

What this means for operators

The goal is simple: less scary failure behavior in real clusters. When peers reconnect after faults, replication should converge deterministically, and operational gates should clearly report health. TLA+ is one of the tools we use to keep that promise honest.