~ /work /lattice-os

Lattice OS.
An ops platform that consolidated four legacy tools and a ledger.

// A multi-tenant operations platform for fintech back-offices. I joined as employee №3 in 2024 to lead the rewrite of the core ledger and consolidate three internal tools that were costing a 14-person ops team most of their weeks.

role Lead Engineerteam 5 engineerstimeline 18 monthsstatus shipped · v2 rolling out

// 01 — overview

What it is, in one paragraph.

Lattice is a single-pane-of-glass workspace for ops teams at lending and BNPL companies. It replaces a stack of: Looker, Retool, three internal CRUD admins, two Notion pages, and an unhealthy amount of Slack. Underneath it, a deterministic ledger that survives audits, refunds, disputes, and migrations without losing money or sleep.

2.4M

txns / day

tenants in prod

−68%

ops headcount cost

99.97%

uptime · 12 mo

// 02 — the problem

Why the old stack stopped working.

The company had grown to 11 tenants on a stack written for two. Three concrete failure modes:

Reconciliation took 14 hours of manual spreadsheet work, every night. The ledger stored balances as a column, kept by triggers. Drift was a daily occurrence and nobody trusted the numbers without re-deriving them by hand.
Refunds and disputes had five different code paths. Each path mutated rows in place. Auditors could not reconstruct the state of an account at a point in time, which was a SOC 2 finding.
Ops engineers were writing Retool scripts as a way of life. A new tenant required ~40 hours of bespoke admin tooling. The team had become a custom-development shop for itself.

// 03 — approach

How I framed the rewrite.

Three rules, written down on day two, that the team referred back to constantly:

The ledger is append-only. No updates, no triggers, no cached balances. Projections are derived.
Internal tools are first-className product. Every admin surface is built like a customer-facing product: typed contracts, audit logs, permissions, undo.
If we can’t hand it to the next team in two weeks, we built it wrong. Every primitive comes with a runbook and a 30-minute screen recording.

// The full RFC, with dissent, lives in the company Notion. I’ll publish a sanitized version eventually.

// 04 — architecture

The shape of the system.

A simplified version of the data path. Real diagram has six more boxes.

   client (Next.js / tRPC) │ ▼ ┌──────────────┐   idempotency + auth │   gateway    │◀── ReBAC tuples (SpiceDB) └──────┬───────┘ │ ┌──────▼───────┐ │  command     │   writes to: │  service     │──▶ ledger_entries (append-only) └──────┬───────┘   ledger_events (CDC) │ ┌──────▼─────────────┐ │   projector pool   │   reads CDC, writes: │   (Temporal)       │──▶ balances_v └─────────────────────┘   invoices_v reports_v

Three things worth flagging:

Idempotency at the gateway. Every mutating request must carry a request-key. The gateway dedupes before the command service even sees it.
Projections are recomputable. Every *_v table can be dropped and rebuilt from ledger_entries in < 9 minutes. We’ve done it in production. Twice.
Workflows live in Temporal. Anything that touches two services or a third party (refunds, disputes, payouts) is a Temporal workflow with deterministic replays. No more “why did this happen at 03:14” mysteries.

// 05 — tradeoffs

The decisions I’d defend in a code review.

Postgres, not a real ledger DB. We considered TigerBeetle. Tempting, but the team didn’t need 1M txns/sec and we did need JOINs. Won’t apologize.
tRPC, not GraphQL. One consumer (our own Next.js app), shared types, and we never needed federation. GraphQL would have been a cost without an upside.
ReBAC over RBAC. Took a week longer to model. Cost us an afternoon every quarter since.
Temporal, not a homemade queue. The cost is a daemon and a UI. The benefit is a workflow we can debug at 2am without a kernel of stack traces.

// 06 — impact

What changed, measurably.

14h → 4min

nightly reconciliation

5 → 1

refund code paths

3 wks → 3 days

soc 2 review time

40h → 90min

onboard new tenant

The hardest one to quantify, but the one I’m proudest of: the ops team stopped filing tickets to engineering. They build their own queries against the projection layer, in a small DSL we shipped. Engineering has been free to work on the platform for the last 8 months.

// 07 — what i’d redo

Three things that didn’t survive contact with reality.

I shipped tRPC end-to-end before we had a public API need. When we eventually did, retrofitting REST on top was painful. Should have had both contracts from day one.
SpiceDB locally is fine. Operating it is not. If I were starting today I’d probably start with a smaller homemade Zanzibar layer and migrate, not the other way around.
I underinvested in the dev experience of writing new projections. Took eight months before we made it boring; should have done that in month one.

← prev case

Plinth — headless content infra

Founding eng · 2022—2024

next case →

corex / agent-runtime

Open source · 2025—present

Lattice OS.An ops platform that consolidated four legacy tools and a ledger.