edruder.com i write about anything, or nothing

Defense in Depth for Real-Time Rails

In the previous post, I fixed a race condition between Turbo and ActionCable that caused connected clients to diverge during rapid interactions. The fix was straightforward: bypass Turbo’s form submission lifecycle for actions that broadcast via ActionCable, and return head :no_content instead of redirecting. One update path instead of two. No more interference.

That fixed the loudest problem. But the investigation didn’t stop there.

As I traced the behavior through every layer–server concurrency, broadcast delivery, WebSocket lifecycle, client-side state management–I found issues at each one. A read-then-write race in the mutation method that could silently merge concurrent adjustments. A polling interval tuned for low database load rather than real-time responsiveness. A WebSocket disconnection handler that did literally nothing, silently dropping any broadcasts missed during a network blip. And fixing the Turbo lifecycle issue inadvertently resolved a separate class of bugs–phantom sound effects–that I hadn’t fully diagnosed yet.

No single fix would have been sufficient. The Turbo race from the previous post was the most visible failure mode, but each of these quieter issues could independently cause clients to show stale state under the right conditions. The fixes work as layers, each one addressing a different failure mode.

The specific bugs were in a poker timer, but the patterns are general to any Rails app doing real-time sync with ActionCable. This post is less of a debugging story and more of a hardening guide–four fixes, walking from server to client, that made the feature reliable under production conditions.

Pessimistic Locking

The timer’s adjust_time! method had a concurrency bug hiding in plain sight. It read the current state, computed a new value, and wrote it back–without holding a database lock during the read:

def adjust_time!(minutes)
  elapsed_seconds = Time.current - started_at
  current_computed_time = [current_time_remaining - elapsed_seconds, 0].max
  new_time = current_computed_time + (minutes * 60)

  update!(current_time_remaining: new_time, started_at: Time.current)
  broadcast_time_adjusted
end

The timer record is loaded by a before_action at the start of the request (@timer = @event.timer). Each Puma thread gets its own copy of the record with whatever values were in the database at that moment.

If two requests arrive close together, both threads read the same starting state, both compute the same result, and both write it–producing one effective adjustment from two requests.

This is a time-of-check-to-time-of-use race, sometimes called TOCTOU: the state you checked is no longer the state you’re acting on by the time your write lands. This race exists regardless of your database–PostgreSQL, MySQL, and SQLite are all vulnerable when the application reads state in one step and writes in another without holding a lock.

My app uses SQLite, whose single-writer serialization prevents the writes themselves from interleaving, but that doesn’t help–the damage is done in the reads, which happen before either thread tries to write.

In development, this race never triggered. With a local server responding in single-digit milliseconds, Turbo’s button-disable behavior serialized the requests–each response arrived and re-enabled the button before the next click could fire. Requests never overlapped.

Here’s the irony: the data-turbo="false" fix from the previous post actually made this race more likely, not less. With Turbo bypassed, the browser sends standard form submissions with no Navigator lifecycle managing them. Each click fires an independent HTTP request immediately, with no button-disable serialization.

On a connection with 50-100ms of latency, three rapid clicks easily produce overlapping requests. The Turbo fix solved the client-side race but opened the door to a server-side one. Defense in depth means fixing both.

The fix is pessimistic locking–acquire an exclusive row lock before reading, so no other thread can read stale state. Active Record’s with_lock method wraps the block in a transaction and acquires an exclusive row lock–equivalent to transaction { lock!; ... }:

def adjust_time!(minutes)
  with_lock do
    reload  # re-read current state inside the lock
    elapsed_seconds = Time.current - started_at
    current_computed_time = [current_time_remaining - elapsed_seconds, 0].max
    new_time = current_computed_time + (minutes * 60)

    update!(current_time_remaining: new_time, started_at: Time.current)
  end
  broadcast_time_adjusted
  true
end

Two details matter here.

First, reload inside the lock is critical. Without it, with_lock acquires the exclusive lock but the method still operates on the stale in-memory attributes loaded at the start of the request. The lock alone doesn’t refresh them.

Second, broadcast_time_adjusted is outside the with_lock block–broadcasting doesn’t need the lock, and keeping the critical section small avoids holding it longer than necessary.

This pattern applies to any mutation method that reads current state, computes a new value, and writes it back. It’s distinct from idempotency–each timer adjustment is intentionally non-idempotent (clicking “-5 minutes” three times should subtract 15 minutes), so you can’t solve this by making the operation safe to retry. You need the lock to ensure each read sees the result of the previous write.

Tuning Solid Cable for Real-Time Use

Solid Cable is the database-backed ActionCable adapter that ships as the default in Rails 8. Instead of Redis pub/sub, broadcasts are inserted as rows into a cable messages table, and a listener thread polls for new messages at a configurable interval. That interval determines the maximum delay between a broadcast being written and a client receiving it.

Mine was set to 0.5 seconds–a reasonable default that keeps database load low. For most ActionCable use cases (chat messages, notifications, dashboard updates), half a second of delivery latency is fine. For a real-time feature where users expect instant feedback from rapid interactions, it’s not.

The polling interval also affects batching. When multiple broadcasts are written within a single poll cycle, they’re all picked up and delivered together. This isn’t a problem by itself–the client processes them sequentially in order.

But it widens the timing window for the kind of race conditions I described above. A broadcast that would have been delivered immediately with push-based pub/sub instead sits in the database for up to 500 milliseconds, during which other requests are being processed, other broadcasts are being written, and the client’s state may be shifting underneath.

This wasn’t the cause of the Turbo race condition from the previous post–that bug exists regardless of your ActionCable adapter. But the polling interval amplified it. Narrower delivery windows mean less opportunity for timing- dependent failures.

The fix is a one-line config change:

# config/cable.yml
production:
  adapter: solid_cable
  polling_interval: 0.1.seconds  # was 0.5.seconds

The trade-off is straightforward: 5x more frequent polling means 5x more database reads. For a small-scale app on SQLite, this is negligible. At scale, you’d want to benchmark it.

But the general principle holds: adapter defaults are tuned for broad compatibility, not for your specific latency requirements. If you’re building a feature that depends on fast broadcast delivery, check what your adapter is actually doing.

(This only applies to polling-based adapters like Solid Cable. If you’re using Redis-backed ActionCable, pub/sub delivers messages immediately and there’s no polling interval to tune.)

The Phantom Sounds

After fixing the timer sync issue, a separate problem surfaced: brief, phantom sound effects during rapid adjustments. The timer has an audio alert that plays when the blind level advances–a double beep. This sound was firing during interval adjustments on timers that weren’t even running.

Users would hear the start of a beep, cut short almost immediately, as if something triggered the audio and then killed it. This one had me stumped for a while–I initially suspected some kind of audio race condition or localStorage corruption.

Turns out, two bugs were interacting to produce it.

The first bug was spurious broadcasts. The timer is implemented as a Rails engine mounted in a host application. The host app had a before_action that called sync_with_client_state! on every timer request–including adjustment actions.

The engine’s adjustment methods explicitly avoided this call. The code comments said as much: “We intentionally do NOT call sync_with_client_state! here.” The engine treats the client as the source of truth for running timers–the client advances blind levels autonomously, and the server catches up at defined sync points. By adding sync_with_client_state! to a blanket before_action, the host app overrode that design.

The consequence: when sync_with_client_state! ran during an adjustment request, it compared the server’s database state to the computed client state. For a running timer, the server is typically behind–that’s by design. So the sync method “caught up” by advancing blind levels and broadcasting blinds_increase to all clients. These broadcasts were spurious. The blinds hadn’t actually advanced; the server was just reconciling a gap that the engine was designed to tolerate.

The fix was a one-line deletion: remove sync_with_client_state! from the host’s before_action. The engine already calls it at the right times–on page load, on channel subscription, and via a periodic sync job.

The lesson generalizes: when a library comments “we intentionally do NOT do X here,” that’s a design constraint, not a suggestion. Understand the reasoning before overriding it.

The second bug was stale Stimulus controller state during Turbo page replacement. When a spurious blinds_increase broadcast arrived at a client, it triggered the audio alert path. But Turbo was simultaneously replacing the page (following the redirect from the adjustment action).

During page replacement, there’s a window where the old Stimulus controller is still connected–it still has its ActionCable subscription, it still receives broadcasts, and it still acts on them. The old controller received the blinds_increase broadcast and started playing the alert sound. Then Turbo finished the page swap, disconnect() fired on the old controller, and the audio was paused mid-play–producing the abbreviated beep the user heard.

The audio manager compounded the issue–its teardown method was designed to preserve in-flight sounds during navigation, which meant it actively started audio playback during Turbo page replacement.

This bug was already fixed by the time I diagnosed it! The data-turbo="false" change from the previous post eliminated Turbo page replacement for adjustment actions entirely. No page replacement means no Stimulus disconnect/reconnect churn, no window where an old controller receives broadcasts, and no stale audio state. The phantom sounds disappeared as a side effect of fixing the sync race condition.

That’s worth noting as its own lesson: when you fix a lifecycle bug, look for other symptoms that share the same root cause. I investigated the phantom sounds as a separate issue, with separate hypotheses about audio race conditions and localStorage corruption. They turned out to be downstream of the same Turbo page replacement cycle that caused the timer divergence. The fix was the same because the cause was the same.

WebSocket Reconnect Recovery

ActionCable subscriptions provide connected and disconnected callbacks on the client side. The disconnected callback fires when the WebSocket connection drops–network blips, server restarts, mobile devices switching between WiFi and cellular, laptop sleep/wake cycles.

My timer controller’s handleDisconnected() implementation was:

handleDisconnected() {
  // (empty)
}

During a disconnection, the server keeps broadcasting. With Solid Cable, those broadcasts are written to the database as usual. But without a connected WebSocket on the other end, they can’t be delivered. The messages sit in the cable table until they’re trimmed, and the disconnected client never sees them.

The recovery mechanism already existed on the server side. When the client reconnects, ActionCable re-subscribes to the channel, which triggers the server-side subscribed callback. Mine already called transmit_timer_state, which reads the current timer state and sends it as a point-to-point message to the reconnecting client.

The server was offering the client a way to catch up. The client just wasn’t taking it.

The fix is small. On disconnect, flag that I may have missed broadcasts. On reconnect, clear the flag and force-accept the next server state–bypassing the staleness checks that would normally compare it against local state that may no longer be current:

handleDisconnected() {
  this.sync.missedBroadcasts = true
}

handleConnected() {
  if (this.sync.missedBroadcasts) {
    this.sync.missedBroadcasts = false
    this.sync.initialized = false  // force-accept next server update
  }
}

The re-subscription triggers transmit_timer_state on the server, which sends the current state as a point-to-point message. By resetting initialized, the client treats that message as an initial sync rather than an incremental update, accepting it unconditionally instead of comparing it against stale local state.

Brief WebSocket disconnections are more common in production than you might expect. WiFi access points hand off connections. Mobile networks switch cells. Laptops close and reopen. Server deploys cycle processes.

In development, the WebSocket connects to localhost and effectively never drops. In production, disconnections are routine–and if your disconnected callback is empty, every one of them is a silent opportunity for client state to drift.

The general pattern: if your ActionCable subscription manages client-side state, implement the disconnected callback. At minimum, flag that the client may have missed updates. If the server already sends state on subscription (a common pattern), you may not need additional logic–just ensure the client is ready to accept that state as a full resync rather than an incremental update.

This is the catch-all layer. The pessimistic locking ensures the server state is correct. The Solid Cable tuning ensures broadcasts are delivered promptly. The Turbo fix ensures the client processes them without interference. But even if all of those are working, a network blip can still cause a client to miss a broadcast. This layer ensures it recovers.

The Compound Fix

Here’s what I ended up with across both posts–five fixes at four layers of the stack:

Layer Problem Fix
Turbo form lifecycle Navigator abort race on rapid clicks data-turbo="false" bypasses Turbo
Server concurrency Read-then-write race in mutation methods Pessimistic locking with with_lock + reload
Broadcast delivery 500ms polling widens timing windows Reduced to 100ms
WebSocket resilience No recovery from missed broadcasts handleDisconnected / handleConnected

The phantom sounds didn’t need their own fix–they were downstream of the Turbo lifecycle and spurious broadcast issues that the other fixes addressed.

No single row in that table would have been sufficient on its own. The Turbo fix eliminated the most visible race but exposed the server-side concurrency bug. The pessimistic locking ensured correct server state but couldn’t help if a broadcast was missed due to a network blip. The Solid Cable tuning reduced delivery latency but didn’t prevent the race. The WebSocket reconnect recovery catches everything else–but only after the fact.

Each layer catches failures that the others miss. That’s defense in depth applied to application code: you don’t rely on one control, because no single layer can account for every failure mode. The server ensures correctness. The transport ensures timely delivery. The client ensures recovery.

Every one of these bugs was invisible in development. Localhost round-trips in single-digit milliseconds eliminate the timing windows where races happen. WebSocket connections to localhost don’t drop. Solid Cable polling latency doesn’t matter when the broadcast and the response arrive almost simultaneously.

Production–real devices, real networks, real latency–is where these layers get tested. If you can’t reproduce something locally but users are reporting it, the problem is probably in a timing window that your development environment is too fast to open.

If any of this sounds familiar, I’d love to hear about it–leave a comment!