How We Use AI to Review Architecture Decisions

Architecture review is one of the highest-leverage activities in software engineering. A good review catches a wrong abstraction before it multiplies into fifty files. A missed review lets a bad data model propagate through the codebase until ripping it out costs more than the original build.

It's also one of the most expensive activities to do well. A thorough architecture review requires someone with enough context to understand the problem domain, the technical constraints, and the long-term implications of today's decisions. Senior engineers are expensive. Their attention is finite. You can't do a deep architecture review on everything.

This is where we've found AI genuinely useful — not as a replacement for architectural judgment, but as a way to extend it.

The Problem with Architecture Reviews

Most engineering teams skip architecture reviews for the same reason they skip most high-leverage activities: they're expensive upfront with payoffs that are hard to measure. The meeting to review a data model feels like overhead. The consequences of the wrong data model — six months of increasingly painful workarounds — don't trace back to the missed meeting.

The other problem is that architecture review requires adversarial thinking — deliberately looking for what's wrong, what's missing, what could go wrong under load or under unusual conditions. This is psychologically hard. Engineers who've spent a week designing a solution are not well-positioned to adversarially attack their own work.

"The best architecture reviewer for your design is someone who didn't design it and has no attachment to it. That describes an LLM almost perfectly."

What AI Is Good At in Architecture Review

Surfacing known patterns and known antipatterns

LLMs have ingested enormous amounts of engineering writing — postmortems, architecture decision records, system design guides, incident reports. When you describe a system design to a capable model, it can draw on this corpus to identify patterns that match your description and flag risks that have been documented in similar systems.

In practice, this looks like: "Here's our proposed data model for a multi-tenant SaaS application. What are the most common mistakes teams make with this pattern, and do you see any of them here?" The model will produce a list. Most items will be things you've already considered. Some won't.

Stress-testing assumptions

One of the most useful prompts in our review toolkit: "List the assumptions this design makes that would need to be true for it to work correctly. For each assumption, describe what happens if it's wrong."

This is a form of pre-mortem analysis — imagining failure before it happens. An LLM will produce a list that a human reviewer might produce in an hour of focused thinking, in under a minute. The list isn't exhaustive and it isn't always right, but it's a starting point that's faster to evaluate than it is to generate from scratch.

Asking "what did we not think about"

The most common cause of architecture regret isn't getting the wrong answer — it's not asking the right question. "We didn't think about how this would work with multi-region deployments." "We didn't consider what happens when the third-party API is unavailable." "We didn't think about the audit requirements."

Prompting a model with "What aspects of this system design have we not addressed that are commonly important for production systems in this domain?" produces a checklist that covers the obvious gaps. Combined with domain-specific prompting — "We're building for healthcare, what compliance considerations are we not addressing?" — it reliably surfaces things that weren't in the room.

A Real Example: Data Model Review

On a recent project — a multi-tenant B2B platform — we were designing the tenancy model. The initial design was straightforward: a tenant_id column on every table, with row-level security in PostgreSQL enforcing isolation.

Before finalizing, we ran the design through our AI review process. The prompt, condensed:

We're designing a multi-tenant SaaS application on PostgreSQL.
Tenancy is enforced via tenant_id columns + RLS policies.
Tenants have sub-accounts (organizations within a tenant).
What are the most common failure modes of this pattern?
What have we likely not considered?

The model's response surfaced several items. Most were things we'd considered. Three weren't:

Cross-tenant reporting: our design had no clean mechanism for the platform operator to query across tenants without bypassing RLS — a requirement we'd overlooked because the product brief didn't mention it
Tenant offboarding: no mention of what happens to data when a tenant churns — deletion strategy, export format, retention requirements
RLS policy maintenance: as the schema evolves, ensuring every new table gets the correct RLS policy requires process discipline we hadn't designed for

The cross-tenant reporting issue would have cost us a significant rework six weeks later. We caught it in a ten-minute review session.

Where We Don't Trust AI

Two areas where AI assistance in architecture review is unreliable:

Context-specific judgment calls

Architecture decisions are almost always tradeoffs: consistency vs. availability, flexibility vs. simplicity, build vs. buy. An LLM will list tradeoffs well. It will not reliably tell you which side of the tradeoff is right for your specific context — your team's skills, your org's risk tolerance, your product's actual usage patterns. That judgment is the part that requires a human who knows the context.

Novel problem spaces

LLMs know what's been written about. For well-documented domains — web APIs, relational databases, microservices — the corpus is enormous and the model's pattern-matching is reliable. For genuinely novel problems — a new regulatory environment, an unusual hardware constraint, an industry that's not well-represented in technical writing — the model's confidence often exceeds its accuracy. In these cases, AI assistance in architecture review is best treated as a starting point for research, not a source of answers.

Our Process, Summarised

For any non-trivial architecture decision, we run a three-step review:

Document the design — write it down well enough that someone who wasn't in the room can understand it. This discipline is valuable regardless of what comes next
AI adversarial review — prompt for known failure modes, unaddressed assumptions, and domain-specific gaps. This takes 15–30 minutes and consistently surfaces at least one important consideration
Human sign-off — a senior engineer reads the AI review output, evaluates each item against the project context, and makes the final call. The AI identified the issues; the human decides what to do about them

This process doesn't replace architectural expertise. It makes architectural expertise go further — covering more ground in less time, with fewer things falling through the cracks.

That's what using AI as a precision instrument looks like in practice: not generating answers, but improving the quality of the questions.