blog.krauza.com
~ / posts / pocket_id_ha_part_2 / index.md

Pocket ID HA, Part 2: Write Proxies, 401s, and the Edge Cases You Don't Think About Until Bed

Building the write proxy for Pocket ID’s HA secondary node exposed three cascading failures — a dropped cookie, mismatched JWT signing keys, and a WebAuthn registration edge case — each one invisible until the previous was fixed.

Part 1 of this series covered the sync architecture I designed for running Pocket ID across two sites: a primary/secondary polling model, ECIES encryption over X25519, and a reconciliation loop that keeps users, passkeys, OIDC clients, and app config consistent without re-enrolling a single credential. That post made the whole thing sound fairly contained. This post describes what I didn’t think about.

I went to bed after finishing Part 1, and somewhere around midnight, a thought occurred to me: what actually happens when a user lands on the secondary node and tries to do a write? Not a login — a write. Changing a display name, adding a passkey, updating an OIDC client. The sync flows primary to secondary. There’s no path for secondary writes to reach primary. Which means the answer, in the state the code was in after Part 1, was: the write succeeds locally, gets silently overwritten on the next sync pull, and the user sees their change disappear.

That’s the kind of edge case that AI-assisted development doesn’t catch, not because the AI doesn’t know about write proxying, but because nobody asked the right question at the right moment. I was focused on the sync architecture. The write problem lived one layer above it, in a scenario I hadn’t loaded into the conversation. That’s the work of a human who actually knows the system — noticing the gap after signing off and then doing something about it.

The First Idea: Tell the User They’re on a Secondary

Before writing any proxy code, my first instinct was to handle this at the UI layer. The plan was to expose syncRole in the public app config API response, surface it in the frontend’s AppConfig type, and show a full-width banner whenever syncRole === 'secondary': You are connected to a read-only replica. Please use the primary node to make changes."` Additionally, the admin panel on secondaries would be stripped down to just the Sync Peers tab.

I built most of it before scrapping it.

The problem wasn’t implementation complexity — it was that the approach was wrong. A permanent degraded-experience banner and a watered down admin panel means every user who happens to land on the secondary would have to find their way to the primary instance somehow. It also requires the load balancer to have write-aware routing logic, so that anything needing writes gets steered to primary. Also, it doesn’t solve the API layer at all: anything that bypasses the UI, any OIDC client or API consumer, would still silently lose writes.

The better answer was to make the secondary transparent. Users shouldn’t have to know or care which node they’re on. That framing made the right approach obvious: the secondary itself should intercept write requests and forward them to primary. From the browser’s perspective, the request goes out and a response comes back. The proxy is invisible.

Building the Write Proxy

The implementation lives in backend/internal/middleware/secondary_proxy.go. It uses Go’s httputil.ReverseProxy, registered as the first Gin middleware on secondary nodes, so it runs before CSP and CORS headers are added. That ordering matters: proxied responses from primary already carry those headers, and you don’t want the secondary adding a second set.

The routing logic is simple: GET, HEAD, and OPTIONS are served locally. Everything else (POST, PUT, PATCH, DELETE) is forwarded to primary — except for a set of paths that can’t safely be proxied.

The localOnlyPaths Problem

Several flows in Pocket ID pair a GET request that creates server-side state with a POST that must find that state on the same node. Proxying those writes to primary would break the pairing, because the state was created on the secondary:

  • WebAuthn: register/start creates an in-memory challenge; register/finish must validate it on the same node.
  • OIDC: /oidc/authorize creates an auth code in the local database; /oidc/token must exchange it on the same node.
  • OIDC device flow: device/authorize and device/verify are similarly pinned to each other.
  • SAML: uses an in-memory pending session store.
  • Sync routes: internal replication endpoints, never for client use.

Getting this list right required auditing the actual controller logic rather than guessing from path names. The first pass included paths like /api/one-time-access-token/ and /api/signup, which looked stateful but are actually pure writes with no paired GET. Those should go to primary, not stay local. After trimming:

go
 1var localOnlyPaths = []string{
 2    "/api/webauthn/register/",
 3    "/api/webauthn/login/",
 4    "/api/oidc/token",
 5    "/api/oidc/authorize",
 6    "/api/oidc/end-session",
 7    "/api/oidc/introspect",
 8    "/api/oidc/userinfo",
 9    "/api/oidc/device/",
10    "/saml/",
11    "/sync/",
12}

One smaller thing worth noting: the implementation switched from the deprecated Director field on httputil.ReverseProxy to the newer Rewrite function (Go 1.20+) to avoid compiler warnings. Minor, but worth keeping the dependency graph clean.

With the proxy registered and write requests routing to primary, the next question was whether it was actually working. The answer was no — primary was returning 401 on every proxied request.

The Cookie That Wasn’t Making the Trip

The first thing to verify was whether the auth cookie was actually being forwarded, and for that I turned to my true and tired friend Wireshark. Being that the traffic between the two nodes was over HTTP (hence the previous design decision to use ECIES for the sync), I was able to sniff the traffic. Wireshark on the network between the two instances confirmed: the cookie was arriving at the secondary correctly, but the secondary was not including it in the forwarded request to primary. The Cookie header was being dropped somewhere in the middleware chain before the Rewrite function ran.

Wireshark packet capture of the traffic between Primary and Secondary Nodes
Wireshark packet capture of the traffic between Primary and Secondary Nodes

httputil.ReverseProxy copies most request headers automatically, but not this one. The fix was explicit:

go
1Rewrite: func(r *httputil.ProxyRequest) {
2    r.SetURL(target)
3    r.Out.Host = target.Host
4    r.SetXForwarded()
5    if cookie := r.In.Header.Get("Cookie"); cookie != "" {
6        r.Out.Header.Set("Cookie", cookie)
7    }
8},

We also added structured logging of the cookie names being forwarded (not values, obviously) so I could confirm the right cookies were making it through without writing secrets to logs:

text
1INF Forwarding Request to Primary Host primary=... path=... cookies=[__Host-access_token locale]

With the cookie confirmed arriving at primary, the 401 should have gone away. It didn’t.

The 401 That Wasn’t About the Cookie

This is where it got interesting.

PocketID’s auth middleware does two things on every authenticated request: it reads the __Host-access_token cookie and calls jwtService.VerifyAccessToken(), then it looks up the user by ID in the database. The JWT verification uses the node’s private RSA key to validate the signature. That’s the piece I hadn’t thought about.

Each Pocket ID instance generates an RSA-2048 signing key on first startup and stores it in the database, encrypted with a Key Encryption Key (KEK). The KEK is derived as:

text
1KEK = HMAC-SHA3-256(ENCRYPTION_KEY, "pocketid/" + instanceID + "/jwk-kek")

The instanceID is a UUID generated per instance and stored in the app config table. Critically, it is tagged internal — deliberately excluded from the sync payload. Each node is supposed to have its own identity. That’s a correct design decision for a single-node deployment.

The consequence for HA was that the secondary generates its own RSA key pair at startup. When a user logs into the secondary, the resulting JWT is signed with the secondary’s private key. When that JWT is sent to primary, the primary tries to verify it with its own private key — a completely different key pair. The signature check fails. That’s the 401.

I had a feeling thinking about this problem last night, and turns out, I was right. That is why I previously discounted this idea earlier in my design (because I figured there was some key nonsense), but I had to bite the bullet and figure out the token problem anyway.

First Attempt: Sync the kv Table

The JWT private key is stored in a kv table row under jwt_private_key.json. The natural first instinct was to add the kv table to the sync payload so the secondary would receive and store primary’s encrypted key blob.

This failed because the KEK is derived from instanceID, and each node has a different instanceID. The secondary computed a different KEK and couldn’t decrypt primary’s blob:

text
1WRN Sync: failed to reload JWT key after applying payload
2    error="failed to load key: failed to decrypt private key: failed to decrypt data"

It got worse from there. The job had actually written primary’s encrypted blob into secondary’s kv table during the sync. On the next restart, the secondary tried to load that blob using its own KEK, and crashed on startup:

text
1ERR Failed to run pocket-id
2    error="failed to initialize services: failed to create JWT service:
3           failed to load key: failed to decrypt private key: failed to decrypt data"

The recovery was to wipe the secondary’s database and let sync rebuild it fresh from primary. Not ideal, but it was the right call — the database was in a bad state and there was no cleaner path out. (Also, this is the whole point of the replication, loosing that data was totally acceptable.) The code fix was to abandon the kv approach entirely.

The Correct Fix: Export Raw JWK in the Sync Payload

The KEK problem exists because it’s tied to instanceID, and instanceID is intentionally different on each node. The way around it is to bypass the KEK entirely.

Primary’s JWT service already has the private key decrypted in memory. The method exports it as raw JWK JSON bytes and embed it directly in the sync payload:

go
1// On primary — ExportEncrypted()
2jwkBytes, _ := s.keyExporter.ExportPrivateKeyJWK()
3payload.JwtPrivateKeyJWK = jwkBytes

The entire sync payload is already wrapped in ECIES encryption, as described in Part 1 — only the registered secondary can decrypt it. So the raw key material is protected in transit without needing an additional encryption layer. It’s not bypassing security; it’s relying on the security that’s already there.

On the secondary, after applying the sync payload and reloading app config, the ImportPrivateKeyJWK method is called to hot-swap the in-memory signing key without touching the database or involving the KEK at all:

go
1// On secondary — PollPrimary()
2if len(payload.JwtPrivateKeyJWK) > 0 {
3    jr.ImportPrivateKeyJWK(payload.JwtPrivateKeyJWK)
4}

From that point on, the secondary verifies tokens using the same private key as primary. Proxied authenticated writes started succeeding. The 401 went away.

The JWT key issue was the most interesting failure in this whole session, and also the easiest one to miss. Each node’s signing key is intentionally isolated — the KEK includes instanceID for exactly that reason, and that’s the right design for standalone deployments. But an HA proxy layer introduces a new requirement: tokens issued on one node need to be verifiable on another. That’s not a problem the original design needed to solve, so it didn’t. Recognizing that the requirement had changed, and that the fix needed to work within the existing encryption architecture rather than around it, is the kind of thing that takes a moment to see clearly.

The WebAuthn Registration Edge Case

With authenticated writes flowing correctly, the last problem surfaced during passkey registration.

The registration flow is:

  1. GET /api/webauthn/register/start — the secondary generates a WebAuthn challenge and stores it in memory
  2. POST /api/webauthn/register/finish — the secondary validates the challenge and writes the new credential to its local database
  3. PATCH /api/webauthn/credentials/:id — the browser renames the new credential

Step 3 was being proxied to primary, which returned 404. The credential had been written to the secondary’s database in step 2, not primary’s. Sync flows primary to secondary, not the other direction, so there’s no path for that credential to reach primary before step 3 fires.

We tried narrowing localOnlyPaths to just register/ and login/ sub-paths, letting credential management (PATCH, DELETE) proxy to primary. That made the 404 more explicit, but it didn’t fix anything — the timing problem remained. The credential created by register/finish on the secondary doesn’t exist on primary until the next sync cycle, but the PATCH fires immediately after registration completes.

The root issue is architectural. The WebAuthn challenge is in memory on the secondary. The entire registration flow is pinned to that node because of it. There’s no way to proxy register/finish to primary without also moving the challenge there, and that would require restructuring how the challenge is stored.

The correct fix is at the load balancer: route all /api/webauthn/register/ traffic to primary, always. When that’s in place, the challenge is created in primary’s memory, register/finish writes the credential to primary’s database, and sync replicates it down to secondaries on the next cycle. Login flows can hit any node since credentials are already synced everywhere.

This is the one case where the write proxy can’t fully hide the primary/secondary distinction from the client. The LB needs to know about it. That’s an acceptable tradeoff — it’s a single routing rule for a specific path, and it keeps the registration flow clean and reliable. In the event that the primary is unreachable, users can’t register new credentials, but in that scenario, the authentication service being available is the more important part, not the ability to perform updates or registrations.

What did I learn?

None of these problems are out of the norm. They’re the ordinary consequences of building an HA layer on top of a system that was designed for a single node. The JWT signing key issue is probably the most instructive: the KEK derivation including instanceID is exactly the right design for standalone use, and it’s exactly the thing that needs special handling the moment you introduce a proxy. The design didn’t fail — it just encountered a requirement it wasn’t built for.

Claude helped significantly with the proxy implementation and the kv table approach (including flagging, fairly quickly, why that approach would fail). But the late night realization that secondary writes were a problem at all — that came from thinking about the system as a whole, in the way that tends to happen when you’re no longer actively coding and your brain is doing its background processing. That’s not a knock on AI assistance. It’s just an honest account of where the boundary sits. The tools are good at solving the problem you bring to them. They’re not going to lie awake thinking about the problem you forgot to bring.

Search