loke.dev
Cover image for The Mystery of the Ghost Transaction: What I Learned After 48 Hours of Debugging Distributed Race Conditions

The Mystery of the Ghost Transaction: What I Learned After 48 Hours of Debugging Distributed Race Conditions

Tracking down a non-deterministic bug taught me that 'eventual consistency' is more than just a buzzword.

· 4 min read

It was 3 AM, and a single customer account in production was flickering like a dying lightbulb. One second the balance was $500; the next, it was $0, only to revert to $500 a heartbeat later after a simple page refresh. I felt like the database was gaslighting me.

The Architecture of a Lie

On paper, our setup was standard. We had a Wallet Service written in Node.js, a PostgreSQL primary for the "source of truth," and a Redis cluster for high-speed balance lookups. To keep things performant, we used the Cache-Aside pattern. When a transaction occurred, we updated the DB, deleted the Redis key, and let the next read repopulate the cache.

It looks harmless in a flowchart. In production, under the weight of 10,000 requests per second, it became a crime scene.

The "Ghost" in the Code

The bug manifested during peak traffic. A user would trigger a "Withdraw" action, the money would leave their account, and then—for a window of about 200 milliseconds—the money would magically reappear. If they were fast enough (or had a script), they could double-spend.

Here was the logic we thought was "safe":

async function withdraw(userId: string, amount: number) {
  // 1. Update the database
  await db.transaction(async (tx) => {
    const user = await tx.users.findOne(userId)
    if (user.balance < amount) throw new Error('Insufficient funds')

    await tx.users.update(userId, {
      balance: user.balance - amount,
    })
  })

  // 2. Invalidate the cache
  await redis.del(`balance:${userId}`)
}

async function getBalance(userId: string) {
  // 1. Try cache
  const cached = await redis.get(`balance:${userId}`)
  if (cached) return Number(cached)

  // 2. Fallback to DB (Read from a Replica for 'scaling')
  const user = await dbReplica.users.findOne(userId)

  // 3. Repopulate cache
  await redis.setex(`balance:${userId}`, 3600, user.balance)
  return user.balance
}

Do you see it? I didn't. Not for the first twelve hours, anyway.

The Anatomy of the Race Condition

The culprit wasn't a single line of code, but the space _between_ the lines. We were running a primary database for writes and several read-replicas to handle the GET traffic.

Here is exactly how the ghost was born:

  1. Thread A (Withdrawal): Updates the Primary DB. The balance is now $0.
  2. Thread A: Sends the redis.del command. The cache is now empty.
  3. Thread B (Balance Check): Sees the cache is empty. It queries the Read-Replica.
  4. The Catch: The Read-Replica is lagging by 50ms. It hasn't seen Thread A's update yet. It still thinks the balance is $500.
  5. Thread B: Takes that stale $500 and writes it back into Redis.
  6. The Result: The cache is now "poisoned" with old data indefinitely (or until the TTL expires).

The transaction had happened, but the system had "forgotten" it. It was a ghost transaction—invisible to the cache despite being etched in the primary DB.

Why "Just Use a Distributed Lock" Isn't Enough

The knee-jerk reaction is to throw a Redlock or a mutex around the whole thing. But locking the entire withdraw and getBalance flow globally is a great way to turn your high-performance microservice into a very expensive sequential processor.

We needed a way to ensure that the cache wasn't populated with stale data _without_ killing our throughput.

The Fix: Defensive Invalidation

Instead of just deleting the key, we shifted to a "Delete and Delay" strategy (sometimes called the Delayed Double Deletion). But even that felt like a hack.

The real solution involved embracing Version Vectors or Write-Through timestamps. We modified the cache to store not just the value, but the LSN (Log Sequence Number) or a simple version counter from the database.

If the cache population logic attempted to write a version that was _older_ than what the service had recently seen, the write was rejected.

async function getBalanceSafe(userId: string, minVersion?: number) {
  const data = await redis.get(`balance:${userId}`)
  const { balance, version } = JSON.parse(data)

  if (data && (!minVersion || version >= minVersion)) {
    return balance
  }

  // Force read from Primary if replica might be stale
  const user = await dbPrimary.users.findOne(userId)

  // Use Lua script in Redis to perform "Set if Version Higher"
  await redis.eval(
    compareAndSetScript,
    1,
    `balance:${userId}`,
    user.balance,
    user.version
  )

  return user.balance
}

Lessons from the Trenches

  1. Replicas are eventually consistent, whether you like it or not. If you write to a Primary and immediately read from a Replica, you are gambling.
  2. The "Cache-Aside" pattern is a lie in high-concurrency systems. Without versioning or strict sequencing, your cache is just a fancy place to store incorrect data.
  3. Observability is your only hope. We only caught this because we started logging X-DB-Node headers and comparing them against the cache state in our ELK stack.

Distributed systems are hard because they break in the gaps between your mental model and the physical reality of networks and disk I/O. Sometimes, the fix isn't more code—it's admitting that "now" is a relative term in a cluster.