Skip to content

05 · Data & State: The System's Real Hard Part

The thesis in one line: logic is easy to change, data is hard. Get the code wrong and you change a line and redeploy; but when hundreds of millions of records already lie in the database in some structure, and users' state is shifting in real time, and you want to change it — that's deadly. The real hard part of architecture has never been logic, but data and state.


Why "state is the root of all evil, and the source of all value"

First, build the most important intuition: sort the things in a system into two kinds —

  • Stateless: it remembers nothing. You give it input, it computes output, and then forgets. Like a function converting Fahrenheit to Celsius, or a gateway that just forwards requests.
  • Stateful: it remembers things, and those memories change over time. Like "a user's account balance," "what's in the cart," "who's online in this room right now."

These two are worlds apart in "how easily they scale out":

   Stateless component: like a convenience-store cashier — anyone will do,
   add a few more when short
   ┌────┐ ┌────┐ ┌────┐ ┌────┐
   │inst│ │inst│ │inst│ │inst│   ← want to scale? just copy one, split traffic
   └────┘ └────┘ └────┘ └────┘       freely; they need know nothing of each other

   Stateful component: like the one unique manuscript in a library
   ┌──────────────────┐
   │ It "remembers"    │   ← want to copy it? then how do the two copies' "memories"
   │ things            │      stay consistent? Copying brings the sync problem —
   │(balance/inventory/│      and this is the root of all the trouble
   │ session)          │
   └──────────────────┘

Stateless components scale easily; stateful ones don't. This is one of the first principles of distributed systems.

So you'll see a recurring architectural move: find every way to squeeze "state" out of "computation," and gather it into a few dedicated places that manage state (databases, caches), keeping the rest of the components as stateless as possible so they can be freely copied. This way the bulk of the system can scale horizontally with ease, and only "the hard-to-handle state" is corralled into a few controllable spots to be tended with care.

But state is by no means all downside. In the end:

State is the root of all evil — every consistency problem, scaling problem, failure-recovery problem traces back to it. State is also the source of all value — users' data, orders, relationships, memories are exactly the entirety of a product's value.

The architect's job is not to eliminate state (you can't), but to manage state crisp and clear: which data goes where, how strong its consistency must be, how it scales, how it stays neither lost nor scrambled during failures. This chapter is about all that.


I. Where data lives: different "access shapes" call for different stores

The most common beginner mistake is "I know how to use one database, so I'll stuff everything into it." That's like using the same box for everything — for liquids, for screws, for clothes — bound to be awkward.

The right approach: first look at what this data "looks like and how it's accessed," then pick the store that fits it best. This is called "choosing storage for the access shape." Here are the most common store types in an architect's toolbox, each with a sentence or two on "what kind of data / access suits it":

Store typeBest-suited data / accessOne-line intuition
RelationalClear structure, complex relationships, core data needing transactions and strong consistency (users, orders, accounts, inventory)The default home for "money, ledgers, relationships"; when you need rigor, reach for it
DocumentFlexible structure, self-contained, fetched/stored whole by ID (an article, a config, a session record)A blob of "semi-structured JSON"; use it when fields change often
Key-value (KV)Dead-simple "get value by key," demanding extreme speed (sessions, counters, cache, features)A giant dictionary; what you want is lightning-fast reads and writes
ColumnarAnalytical aggregation over massive data (reports, computing "sales by region over the past year")Stored by column; suited to "scan a swath, compute a sum/average"
GraphData whose focus is the relationship network itself (social graph, recommendation, risk-control linkage, knowledge graph)"Who knows whom, who connects to whom"; the expert at querying relationship paths
VectorData retrieved by semantic similarity (text/image embeddings, for RAG, recommendation)"Find the one closest in meaning to this"; an ordinary database can't do it
Object storageLarge, immutable files fetched whole by ID (images, video, model weights, backups)A warehouse for massive large files; cheap, capacious, unchanging
Search engineData needing full-text search / complex filtering and sorting (product search, log search)"Type a keyword, fuzzily pull out a relevant bunch"
Time-seriesTimestamped, append-only, time-aggregated data (monitoring metrics, sensors, billing records)An ever-growing timeline; good at "compute trends over time windows"

Note: a slightly complex system almost certainly uses several stores at once. This isn't "impure" — it's precisely the mark of maturity, and it's called "polyglot persistence," i.e. "let each kind of data live in the home that suits it best."

Recall Section 7 of the AI chat product template: relational for users/billing, document for session history, a vector store for knowledge retrieval, object storage for model weights, time-series for usage records — within one product, five stores each do their job. This is a living textbook of "choosing storage for the access shape."

How to judge which to use? Don't memorize the table — ask yourself three questions:

  1. What's this data's "read/write shape"? High-frequency small reads/writes (KV), whole-store-whole-fetch (document/object), wide-range aggregation (columnar), or finding by relationship/semantics/keyword (graph/vector/search)?
  2. How strong a consistency does it need? Not a single cent can be wrong (relational, strong consistency), or a tiny bit off is fine (many KV and search scenarios tolerate lag)?
  3. How big will it grow? How frequent is the access? This decides whether you must consider, from the start, the replication/sharding covered later.

II. The consistency spectrum: from "not a hair off" to "consistent eventually"

"Consistency" is the most twisty and most crucial concept in the data world. First pull it from "black or white" out into a spectrum:

   Strong consistency ◀───────────────────────────────▶ Eventual consistency
   (right after a write, everyone reads          (after a write, a moment passes
    the newest value)                             before everyone sees the newest)

   "deduct money, balance is correct        "you tapped Like; others may see +1
    everywhere instantly"                    a few seconds later"
   rigorous, but costly, hard to scale,     loose, scales well, highly available,
   easily slows things down                 but has a "brief inconsistency window"
  • Strong consistency: after any write, all subsequent reads immediately read that newest value. The cost: to keep all replicas "in lockstep," you sacrifice either speed (wait for everyone to confirm) or availability (refuse service if you can't coordinate).
  • Eventual consistency: after a write, the replicas gradually sync over some span of time, eventually converging; but during that "moment" window, different places may read different values. In exchange you get high availability and easy scaling.

CAP in plain words

Unavoidable is the CAP theorem. It's often explained mysteriously, but the core is one plain sentence:

When the network fails and splits your system into two halves that can't reach each other (a partition, P), you can only pick one of two things:either keep serving but possibly hand out inconsistent data (pick availability, A), or refuse service so as not to err (pick consistency, C).

The key is: the network partition (P) isn't yours to choose — as long as it's a cross-machine distributed system, the network will eventually glitch, and a partition will surely happen. So CAP's real meaning isn't "pick two of three," but:

        When the network is fine: you can have both C and A, all is calm

              A network partition occurs (sooner or later)

              ┌──────────┴──────────┐
              ▼                     ▼
       Pick CP (consistency first)  Pick AP (availability first)
   "rather not serve than           "rather give stale data than
    give wrong data"                 stop serving"
    suits: money, ledgers,           suits: like counts, view counts,
    inventory                        activity feeds

Which data needs strong consistency, which can be eventually consistent?

This is a business judgment, not a technical one. There's only one test: how big a real consequence does a "brief inconsistency" of this data cause?

  • Must be strongly consistent (severe consequences, errors cost money/cause incidents):
    • 💰 Account balance, payments: over-deducting a cent, double-spending a transaction — each is an incident.
    • 📦 Inventory deduction: oversell and you've sold goods that don't exist, owing compensation and apology (this is e-commerce's lifeline).
    • 🔐 Uniqueness constraints: the same username can't be registered twice.
  • Can be eventually consistent (brief inconsistency nobody truly cares about):
    • 👍 Like counts, view counts: you see 1024, it's actually 1025 — who's harmed? Reconcile a few seconds later, fine.
    • 📰 Social activity feeds: the post you made, followers seeing it a few seconds late, totally acceptable.
    • 🔔 Notifications, read state: a little late does no harm.

Architectural wisdom: strong consistency is expensive — don't splurge it everywhere. Spend your precious strong-consistency "quota" where things truly go wrong (money, inventory), and use eventual consistency elsewhere to buy availability and scalability. A system that makes even "like counts" globally strongly consistent is paying a steep performance-and-availability price for rigor it doesn't need.


III. Transactions and ACID, and the BASE approach

When several things must "all succeed or all fail," you need a transaction. The classic example: a money transfer — "A minus 100" and "B plus 100" must both succeed or both fail, never just half (money vanishing into thin air or appearing from nowhere).

Traditional transactions (especially in a single-machine relational database) pursue the four ACID properties:

  • A — Atomicity: a group of operations is one indivisible whole; either all complete, or it's as if nothing happened (roll back on error).
  • C — Consistency: before and after the transaction, the data satisfies the prescribed rules (a balance won't go negative, etc.).
  • I — Isolation: when multiple transactions run concurrently, they don't interfere, as if queued one by one.
  • D — Durability: once committed successfully, the data lands permanently — a power loss won't lose it.

ACID is the "rigorous school" — it gives you strong consistency and strong guarantees. But in a distributed world spanning multiple services/databases, maintaining ACID is extremely expensive or even impossible (remember microservices' greatest pain in Chapter 04? cross-service transactions are nearly impossible).

So another approach arose — BASE, the "pragmatic school":

  • BA — Basically Available: the system as a whole stays available, even if locally degraded.
  • S — Soft State: data is allowed to exist in an "intermediate state," not required to be strictly consistent at every instant.
  • E — Eventual Consistency: after some time, the data will eventually reach consistency.
   ACID (rigorous school)              BASE (pragmatic school)
   "every step must be absolutely      "allow temporary imperfection, but
    correct"                            guarantee it reconciles in the end"
   strong consistency, strong          high availability, high scalability
   guarantees
   cost: hard to scale, may slow/refuse cost: must tolerate "intermediate states,"
   service                              logic harder to write
   suits: money, order core             suits: large scale, lag-tolerant scenarios

This isn't "which is more advanced," but "different sides taken before CAP": ACID leans CP, BASE leans AP. Mature systems often mix them: core transactions go ACID strong consistency, while peripheral stats, notifications, and streaming data go BASE eventual consistency. Once again it proves — there is no best, only the most fitting.


IV. The three great means of scaling data: replication, sharding, caching

When data volume and access volume climb and a single database can't hold up, you have three cards in hand. The key is: each card solves a different problem, and each has a different cost.

Means 1: Replication — mainly to "scale reads"

Make several identical copies of the data. The most common is "primary-replica": one primary handles writes, several replicas handle reads, and the primary's changes are continuously synced to the replicas.

                  write ┌────────┐
   Write request ─────▶ │ Primary │
                        └───┬────┘
                  sync↙     │   ↘sync
              ┌────────┐ ┌────────┐ ┌────────┐
   Read req ─▶│Replica 1│ │Replica 2│ │Replica 3│ ◀─── reads split across replicas
              └────────┘ └────────┘ └────────┘
  • Solves: the pressure of being read-heavy (the vast majority of systems read far more than they write). Read requests are spread across a pile of replicas, and read capacity scales horizontally with the number of replicas. It also provides redundancy as a bonus (if the primary dies, a replica can take over, raising availability).
  • Cost: ① primary-replica sync has lag, so a replica's data may be a touch staler than the primary's (eventual consistency again! read a replica right after a write and you may read a stale value); ② write capacity isn't scaled — all writes still press on that one primary; ③ the primary remains the single point for writes, and failover has its complexity.

Means 2: Sharding / partitioning — mainly to "scale writes"

When writes also can't hold up, or the data is too big to fit on one machine, you shard: by some rule (e.g. user ID), slice the data horizontally into many parts spread across multiple machines, each storing only a portion and handling the reads/writes of only that portion.

   Shard by "user ID":
   Users 0–999     ──▶ ┌─────────┐  Shard A (independent machine, own reads/writes)
   Users 1000–1999 ──▶ ┌─────────┐  Shard B
   Users 2000–2999 ──▶ ┌─────────┐  Shard C
   ...                            each shard minds only its own pile,
                                  write pressure divided up
  • Solves: the horizontal scaling of write capacity and storage capacity — something replication can't do. Ten shards can, in theory, withstand ten times the writes.
  • Cost (sharding is a "heavy weapon," the cost is large):
    • Cross-shard operations become extremely hard: want "the top 10 spenders across all users"? The data is scattered across shards — either query each and aggregate, or you simply can't. Cross-shard transactions and joins are basically gone.
    • A wrong shard key is a catastrophe: pick an unevenly distributed key and you get a "hot shard" — one machine hammered while the rest sit idle (e.g. shard by region but 80% of users are in one region).
    • Re-sharding (scaling) is painful: the data is already laid out by the old rule, and adding machines or changing the rule means migrating a massive amount of data — extremely thorny in engineering terms.
  • Therefore: shard as late as possible, and only after thinking it through, especially the choice of shard key, which is nearly a decision of "once set, very hard to reverse."

Means 3: Caching — for "scaling reads + cutting latency"

Put copies of hot data in a place closer to the user and faster to read (usually an in-memory KV), so that the mass of repeated read requests don't have to hit the database every time.

                  ┌─────────┐  hit (fast!)
   Read request ─▶│  Cache   │────────────────▶ return directly
                  │(in-memory│
                  │  level)  │
                  └────┬────┘
                       │ miss

                  ┌─────────┐
                  │Database  │  ─── backfill into cache, hits next time
                  └─────────┘
  • Solves: cutting read latency + relieving database read pressure. This is a high-ROI move, and nearly the first reflex when "reads can't hold up."
  • Cost: ① a consistency problem — the cache holds a copy, so when the database changes, how does the cache update in time? This is the big problem we'll cover next; ② it introduces a new component to maintain; ③ during a cache "cold start" (just launched, cache empty), all requests slam the database at once.

One line to tell the three cards apart: replication scales "reads," sharding scales "writes," caching "cuts latency + scales reads." They're often used together, but solve different dimensions — don't expect adding replicas to fix a write bottleneck; that takes sharding.


V. The three classic problems of caching

Caching's benefits are huge, but it's a classic case of "trading consistency for speed," with three pits nearly everyone falls into.

Problem 1: Cache invalidation and consistency (how do cache and database reconcile?)

The database's data changed, but the cache still holds the old value, so the user reads dirty data. What to do? A common practice is "after updating the database, delete the cache" (so the next read reloads the newest value from the database). But under high concurrency, the timing between "update the database" and "delete the cache" gives rise to all kinds of subtle inconsistencies.

There's no perfect solution here, only trade-offs. The core realization: the moment you use a cache, you've actively chosen to "accept some degree of inconsistency." Your job is to keep the inconsistency window and probability within what the business can accept — e.g. set a reasonable expiration (TTL) on the cache, so dirty data lives at most that long.

Problem 2: Cache penetration / breakdown / avalanche (the cache didn't block it, and the flood slams the database)

The cache's mission is to block traffic for the database. The moment this wall "leaks" at some instant, the flood slams straight into the database and may crush it in a moment:

  • Penetration: a mass of requests query data that doesn't exist at all (e.g. a non-existent user ID). It's not in the cache, so each one passes through to the database. Often exploited maliciously. → Counter: cache the "no such item" result too (null caching), or use a Bloom filter as a first gate.
  • Breakdown: at the very instant a hot key happens to expire, a mass of requests all find the cache empty and rush the database together to rebuild it. → Counter: lock during rebuild, letting only one request query the database while the rest wait for it to backfill.
  • Avalanche: a mass of keys all expire at the same instant (or the cache goes down entirely), and all traffic presses onto the database at once. → Counter: add random jitter to expiration so they don't expire together; and make the cache itself highly available.
   Normal:  traffic ──▶ [cache wall blocks most] ──▶ a trickle leaks ──▶ DB (relaxed)
   Trouble: traffic ──▶ [wall suddenly leaks/collapses] ──────────────▶ DB (crushed)
            ↑                                ↑
       penetration/breakdown/avalanche   DB can't hold up and crashes

Problem 3: Cache-database consistency (the combined manifestation)

Both prior problems ultimately point to the same thing: the cache is the data's "second copy," and wherever there's a copy, there's the risk of "two copies not reconciling." This is essentially the same class of problem as "primary-replica replication has lag" and "after sharding, cross-shard inconsistency" — once you copy data for performance/scaling, consistency becomes a cost you must manage.

Remember this thread that runs through the whole chapter: replication (replicas), sharding, and caching all "manufacture data copies/avatars" to buy scalability, and the cost is the same one — consistency gets harder. There's no free scaling in this world.


VI. Why the data model is the "hardest to change" decision

By now, it's time to return to the opening line and make it fully clear: logic is easy to change, data and state are hard.

  • Logic (code) is "stateless": got it wrong? Change a few lines, run the tests, redeploy, done in minutes. The old code vanishes cleanly, leaving no trace.
  • Data is "stateful," with inertia: once your data model (how entities are split, how relationships are built, what shard key, strong or eventual consistency) is set, a massive amount of real data lies there in its shape, and grows every day.

To change it means:

   Change code:  old code ──delete──▶ new code      (clean, reversible, minutes)
   Change data:  hundreds of millions of old records ──?──▶ new structure

            Must: ① design a compatible plan (how old and new coexist)
                  ② write migration scripts to move/transform the massive data
                     (may run for days)
                  ③ the system must keep serving meanwhile, can't stop
                  ④ on error, be able to roll back — but data may already be
                     half-changed
            (painful, dangerous, measured in "weeks" or even "months")

Data migration is widely acknowledged as one of the most high-risk operations in engineering: it's slow, it's dangerous, it's often irreversible. The more foundational the data decision (shard key, core entity relationships, consistency level), the more its change cost rises exponentially.

So, the heaviest piece of advice in this chapter:

The data model and state management deserve care, and to be thought through as early as possible. In the "needs → constraints → quality attributes → trade-offs" process (02 thinking framework), "how to model the data, where to put it, how strong a consistency, how to scale" should be part of what you think about earliest and most seriously.

This isn't telling you to "over-engineer" by borrowing against the future — it's saying that for decisions with extremely high change costs (especially the data model), it's worth spending more time thinking on the whiteboard, because once they're wrong, the cost of fixing them afterward may be a hundred times that of writing code. Be lazy where you can (logic can be rough first, refined later), be exacting where you must (the data model must be thought through early).


📌 Real-world cases: eventual consistency, beginning with a paper

"Eventual consistency" isn't an excuse for laziness — it's an engineering choice Amazon systematically proposed in the 2007 Dynamo paper: for data like the shopping cart, availability matters more than strong consistency (better to let you see a stale cart than to fail to add to the cart); while a bank balance must be strongly consistent. The same company, choosing consistency by tiering its data — exactly this chapter's core thesis.

  • 📎 Dynamo paper walkthrough
  • And what consistency is "claimed" versus "actually achieved" are often two different things: Jepsen (Kyle Kingsbury's project) specializes in fault-injection testing of various databases in practice, and has found consistency violations in over 20 systems. The lesson: don't blindly trust any "we're strongly consistent" marketing — check whether it passed Jepsen.

Chapter summary

  • The core thesis: logic is easy to change, data and state are hard; stateless is easy to scale, stateful is hard. State is the root of all evil (the root of every consistency/scaling/recovery problem) and the source of all value (a product's value is in its data). The real craft of architecture lies in managing state well.
  • Where data lives: choose storage for the "access shape" — relational (money/ledger/relationships, strong consistency), document (flexible whole-fetch), KV (extreme-speed small reads/writes), columnar (massive aggregation), graph (relationship networks), vector (semantic retrieval), object (large files), search (full-text search), time-series (time aggregation). A complex system naturally mixes several stores.
  • Consistency is a spectrum: from strong consistency (expensive, rigorous, for money and inventory) to eventual consistency (cheap, highly available, for like counts and activity feeds). CAP in plain words: a network partition happens sooner or later, and when it does, you can only pick one of "consistent" and "available."
  • Transactions and ACID (rigorous school, leans CP) vs BASE (pragmatic school, leans AP): mature systems mix them — core transactions strongly consistent, peripheral data eventually consistent.
  • Three cards for scaling data: replication scales reads, sharding scales writes, caching cuts latency + scales reads — but all three rely on "manufacturing copies," and the cost is all "consistency gets harder." Caching also has the three pits of invalidation consistency and penetration/breakdown/avalanche.
  • The data model is the hardest decision to migrate — be sure to think it through early.

Bridging forward: this chapter we kept doing one thing — sacrificing a bit of "consistency" for the sake of "scalability / high availability." This is, in fact, one trade-off after another. Consistency, performance, availability, cost… these "quality attributes" conflict with one another, forcing you to choose. The next chapter, 06 · Quality Attributes & Trade-offs, puts consistency onto this larger trade-off chessboard, so you see clearly: you can never have it all.