Lattice OS.
An ops platform that consolidated four legacy tools and a ledger.
// A multi-tenant operations platform for fintech back-offices. I joined as employee №3 in 2024 to lead the rewrite of the core ledger and consolidate three internal tools that were costing a 14-person ops team most of their weeks.
What it is, in one paragraph.
Lattice is a single-pane-of-glass workspace for ops teams at lending and BNPL companies. It replaces a stack of: Looker, Retool, three internal CRUD admins, two Notion pages, and an unhealthy amount of Slack. Underneath it, a deterministic ledger that survives audits, refunds, disputes, and migrations without losing money or sleep.
Why the old stack stopped working.
The company had grown to 11 tenants on a stack written for two. Three concrete failure modes:
- Reconciliation took 14 hours of manual spreadsheet work, every night. The ledger stored balances as a column, kept by triggers. Drift was a daily occurrence and nobody trusted the numbers without re-deriving them by hand.
- Refunds and disputes had five different code paths. Each path mutated rows in place. Auditors could not reconstruct the state of an account at a point in time, which was a SOC 2 finding.
- Ops engineers were writing Retool scripts as a way of life. A new tenant required ~40 hours of bespoke admin tooling. The team had become a custom-development shop for itself.
How I framed the rewrite.
Three rules, written down on day two, that the team referred back to constantly:
- The ledger is append-only. No updates, no triggers, no cached balances. Projections are derived.
- Internal tools are first-className product. Every admin surface is built like a customer-facing product: typed contracts, audit logs, permissions, undo.
- If we can’t hand it to the next team in two weeks, we built it wrong. Every primitive comes with a runbook and a 30-minute screen recording.
// The full RFC, with dissent, lives in the company Notion. I’ll publish a sanitized version eventually.
The shape of the system.
A simplified version of the data path. Real diagram has six more boxes.
client (Next.js / tRPC) │ ▼ ┌──────────────┐ idempotency + auth │ gateway │◀── ReBAC tuples (SpiceDB) └──────┬───────┘ │ ┌──────▼───────┐ │ command │ writes to: │ service │──▶ ledger_entries (append-only) └──────┬───────┘ ledger_events (CDC) │ ┌──────▼─────────────┐ │ projector pool │ reads CDC, writes: │ (Temporal) │──▶ balances_v └─────────────────────┘ invoices_v reports_vThree things worth flagging:
- Idempotency at the gateway. Every mutating request must carry a request-key. The gateway dedupes before the command service even sees it.
- Projections are recomputable. Every
*_vtable can be dropped and rebuilt fromledger_entriesin < 9 minutes. We’ve done it in production. Twice. - Workflows live in Temporal. Anything that touches two services or a third party (refunds, disputes, payouts) is a Temporal workflow with deterministic replays. No more “why did this happen at 03:14” mysteries.
The decisions I’d defend in a code review.
- Postgres, not a real ledger DB. We considered TigerBeetle. Tempting, but the team didn’t need 1M txns/sec and we did need
JOINs. Won’t apologize. - tRPC, not GraphQL. One consumer (our own Next.js app), shared types, and we never needed federation. GraphQL would have been a cost without an upside.
- ReBAC over RBAC. Took a week longer to model. Cost us an afternoon every quarter since.
- Temporal, not a homemade queue. The cost is a daemon and a UI. The benefit is a workflow we can debug at 2am without a kernel of stack traces.
What changed, measurably.
The hardest one to quantify, but the one I’m proudest of: the ops team stopped filing tickets to engineering. They build their own queries against the projection layer, in a small DSL we shipped. Engineering has been free to work on the platform for the last 8 months.
Three things that didn’t survive contact with reality.
- I shipped tRPC end-to-end before we had a public API need. When we eventually did, retrofitting REST on top was painful. Should have had both contracts from day one.
- SpiceDB locally is fine. Operating it is not. If I were starting today I’d probably start with a smaller homemade Zanzibar layer and migrate, not the other way around.
- I underinvested in the dev experience of writing new projections. Took eight months before we made it boring; should have done that in month one.