An audit-first architecture for high-stakes AI systems

Opening
Why This Problem Matters
Architecture
Deal Walkthrough
Beyond the Numbers
Technical Deep Dive
Where It Breaks
What This Means Beyond Underwriting

Opening

Commercial credit underwriting is slow for a reason. You are making high-stakes decisions with messy documents, inconsistent formats, and regulatory expectations that do not tolerate hand-waving. Getting it wrong has real consequences.

I built an AI-augmented underwriting platform that compresses a typical analysis cycle from weeks into days by separating probabilistic extraction from deterministic computation.

Every number and every claim carries end-to-end audit trails back to specific documents, pages, and table cells.

The guiding principle behind this system is simple: in industries that require precision and auditability, LLMs should be treated as inexperienced contributors. Their output must be traceable. Their actions must be explainable. They should be supervised the same way you would supervise a junior analyst on their first deal.

The goal is not fewer people. The goal is better leverage. Systems like this give every analyst the equivalent of a dedicated junior analyst, without removing human ownership from high-stakes decisions.

Background: Why This Problem Matters

Underwriting in a lending context is not just ratio calculation. You are reconciling statements across years, validating management narratives, resolving inconsistencies, and producing work that must withstand internal review, external audits, and future re-examination. For commercial real estate and other large ticket lending, the capital structures are complex and the margin for error is thin.

What makes underwriting slow or risky

The most time-consuming steps are also the most failure prone:

Reconciling financial statements across formats and reporting styles
Re-keying values and validating that numbers did not drift
Maintaining consistent metric definitions across deals and analysts
Performing due diligence, generating insights, and answering lingering questions not explicitly addressed in submitted documents
Reconstructing provenance months later when questions arise

Why existing tools fail regulated lenders

Most automation tools optimize speed, not defensibility. Regulated and policy-driven lenders require:

Deterministic and repeatable calculations
Transparent provenance through every transformation
Structured human review with explicit override paths
Strong defenses against malformed or adversarial inputs

The problem with most "AI for finance" tools is that they hand the LLM authority it has not earned. This system does the opposite.

In practice, this architecture does more than speed up underwriting. It removes the hidden tax of rework, reconciliation, and provenance reconstruction that dominates commercial credit workflows.

Across commercial real estate and C&I deals, the system reduces active analyst time by an estimated 40–60 percent, primarily by eliminating manual re-keying, spreadsheet-based metric calculation, and narrative drafting from scratch. More importantly, it collapses calendar time from the typical weeks long cycle to days by running document ingestion, analysis, and consistency checks in parallel rather than serially.

Audit and review preparation, which often surfaces weeks or months later, shifts from ad hoc investigation to direct lookup. Questions that previously required reconstructing spreadsheets and source documents can be answered in minutes by following the provenance chain embedded in each metric and claim.

Architecture: Neuro-Symbolic by Design

High-level design

The platform is intentionally split into two layers:

Probabilistic layer: LLMs extract structure and draft narrative analysis.
Deterministic layer: normalization, metric computation, validation rules, and workflow state transitions.

The system does not allow probabilistic components to compute final credit metrics. This is not a limitation. It is the point.

The junior analyst analogy

Think of the LLM as a talented but inexperienced junior analyst on their first deal. They can read documents quickly, pull out relevant numbers, and draft a narrative. But you would never hand them the keys to the financial model and walk away.

Instead you would:

Have them extract and organize information
Check their work against known formulas and thresholds
Require them to cite sources for every claim
Flag inconsistencies for senior review
Block progression until issues are resolved

That is exactly how this system treats its LLM components. Every output is validated, every claim is traceable, and nothing reaches the final memo without deterministic verification.

Why this separation matters

If an LLM computes DSCR, you lose reproducibility, auditability, and policy control. In this system:

All financial metrics are computed from a versioned registry of formulas.
Metric evaluation uses a safe AST whitelist, not runtime evaluation.
Narrative claims must reference metric identifiers and are validated before acceptance.

This transforms the LLM from a black box calculator into a probabilistic co-contributor whose outputs are subject to deterministic validation and audit, just like a junior team member whose work is reviewed before it ships.

Six-stage pipeline

flowchart LR subgraph INPUT D[📄 Documents
PDF/Excel] end subgraph STAGE1["Stage 1: Extraction"] E[🤖 LLM Agent
document_extractor.md] end subgraph STAGE2["Stage 2: Normalization"] N[⚙️ Deterministic
Label Mapping] end subgraph STAGE3["Stage 3: Metrics"] M[⚙️ Deterministic
Formula Engine] end subgraph STAGE4["Stage 4: Analysis"] A[🤖 LLM Agent
financial_analysis.md] end subgraph STAGE5["Stage 5: Review"] R[🤖 LLM + Rules
reviewer.md] end subgraph STAGE6["Stage 6: Memo"] U[🤖 LLM Agent
underwriter.md] end subgraph OUTPUT O[📋 Final Memo
+ Findings] end D --> E E -->|ExtractionBundle| N N -->|FinancialWorkbook| M M -->|computed_metrics| A A -->|claims + narrative| R R -->|findings| U U --> O style STAGE1 fill:#e1f5fe style STAGE4 fill:#e1f5fe style STAGE5 fill:#e1f5fe style STAGE6 fill:#e1f5fe style STAGE2 fill:#e8f5e9 style STAGE3 fill:#e8f5e9

A Deal Walkthrough

The Scenario

Borrower: Redwood Precision Fabrication LLC
Industry: Light manufacturing
Request: $1.8M term loan, 7 years
Use of proceeds: Equipment purchase and working capital

Submitted documents:

Three years of financial statements (one scanned)
Interim YTD financials
Accounts receivable aging with customer list
Management projections and assumptions
Existing debt schedule

Qualitative inputs:

Loan application with borrower background and use of proceeds narrative
Project description and timeline
Management and leadership roster with bios
Borrower history and prior lending relationships

The system ingests all of these in parallel. Financial documents flow through the deterministic pipeline. Qualitative documents are processed by dedicated agents that extract claims, flag inconsistencies, and ground every assertion back to a source. The memo draft pulls from both streams.

What each agent found

This is where the junior analyst analogy becomes concrete. Each LLM agent has a narrow responsibility. None of them have authority. All of their outputs are checked.

Financial Risk

Gross margin declined from 28.3% (FY2023) to 24.1% (FY2024).
COGS variance increased 15% YoY, correlating with a supplier disruption note on page 8 of the FY2024 statements.
EBITDA remains positive but shows quarterly volatility, with one quarter falling below break even.

Customer Concentration

Largest customer represents 34% of total revenue.
Top three customers represent 62% of total revenue.
The largest customer also represents the majority of current receivables in the AR aging.

Assumption Validation

Projections assume 18% revenue growth despite historical CAGR of 6 to 8%.
Gross margin improvement is assumed without documented pricing power or cost reductions.
Payroll expense does not increase despite a stated hiring plan.

Projection Consistency

Projected DSCR excludes the full first year debt service of the proposed loan.
Capex increases are not matched by proportional depreciation.
Working capital assumptions conflict with historical AR days.

Policy and Covenants

Policy DSCR threshold: 1.25x.
Concentration exceeds internal guideline threshold.
Suggested mitigants include concentration covenants and reporting conditions.

Narrative and Memo

Strengths: operating history, equipment collateral, mission alignment.
Concerns: customer concentration, margin compression, optimistic projections.
Recommendation: conditional approval or reduced loan size.

Deterministic calculations

All metrics are computed deterministically from normalized financial statements and a versioned metric registry. The LLM never touches these calculations.

Passed:

DSCR (base case): 1.31x (threshold: 1.25x)
Revenue growth YoY: 4.2% (neutral range)

Failed:

Customer concentration: 62% (threshold: 50%)
Liquidity proxy: below minimum cash buffer
Projection consistency: narrative claim of 1.5x DSCR does not match computed 1.31x (blocking)

The junior analyst gets caught

Here is where the system does exactly what a good supervisor would do.

The LLM drafted a narrative claiming DSCR of 1.5x. The deterministic engine computed 1.31x. The review stage flagged this as a blocking finding. The workflow does not advance until the claim is corrected or explicitly overridden with documented evidence.

This is not an edge case. This is the system working as designed. The LLM made an error. The deterministic layer caught it. A human decides what happens next.

Overrides require a reason and evidence reference and are tracked with stable identifiers so they persist across re-runs. This preserves accountability without sacrificing speed.

Beyond the Numbers: Management, Narrative, and Cross-Document Analysis

The financial pipeline is the foundation. But a real credit memo is not just ratios. It is a synthesis of everything the borrower submitted, checked for internal consistency, and grounded in evidence. This is where the system starts to feel less like a calculator and more like a junior analyst who actually read the whole file.

Management and leadership verification

The system does not just accept the management roster at face value. For each person listed, it attempts to verify and enrich:

It searches the borrower's own website first, looking for matching profiles and bios.
If gaps remain, it runs a targeted search using the person's name and company to find their LinkedIn profile.
Every source is cited inline in the memo draft. If the system pulled a bio from LinkedIn, that link is embedded directly so the analyst can click through and verify.

This matters because management quality is one of the most important signals in commercial real estate lending, and it is also one of the easiest things to misrepresent. The system does not trust the roster. It checks it, and it shows its work.

Provenance applies here the same way it does for financial data. The source of every claim about a person is traceable: provided document, borrower website, or external search result. Nothing is assumed.

Cross-document consistency checks

One of the most time consuming parts of underwriting is finding the things that do not match. The system does this automatically across every document type, not just financial statements.

For example: if the CEO is listed as one name on the management roster but a different name appears on the loan application, that is flagged as a question for the borrower. If the project description references a timeline that conflicts with the draw schedule in the loan request, that surfaces as an inconsistency. If the projections assume a use of proceeds that does not match what was stated in the application, the system catches it.

These are not financial calculations. They are the kind of cross-referencing that takes a junior analyst hours to do manually because it requires reading everything and holding it all in context at once. The system does it in parallel across the entire submission.

Every flag includes a citation back to the specific documents and sections where the inconsistency was found. The analyst does not have to hunt for it.

The memo draft

This is the real output. Not a list of metrics. Not a set of flags. A draft credit memo that an analyst can actually hand to a credit committee, with edits.

The memo is structured as a standard underwriting memorandum:

Executive Summary — recommendation, risk rating, and key decision factors
Borrower Overview — synthesized from application, background docs, and verified management information
Loan Structure — drawn from the loan request and matched against projections
Financial Analysis — grounded entirely in deterministic calculations, with every number traceable
Concerns — surfaced by the agents and validated against source documents
Trend Analysis — computed from historical financials
Benchmark Comparison — contextualizes the deal against relevant benchmarks
Projection Assessment — evaluates management projections against historical actuals and flags assumptions
Risk Assessment — synthesizes financial, operational, and structural risks
Conditions and Covenants — proposed based on policy thresholds and identified risks
Recommendation — the final synthesis, grounded in everything above

The LLM drafts this. The deterministic layer validates it. The analyst reviews, edits, and owns it. That is the workflow.

When the junior analyst does not have enough to work with

Not every deal comes in clean. Sometimes the submission is incomplete. Sometimes key documents are missing.

When that happens, the system does not guess or fill in the gaps with plausible-sounding text. It says so explicitly. Sections that cannot be completed are marked as such, with a clear explanation of what is missing and why. The recommendation becomes "more information needed" rather than a fabricated conclusion.

This is the junior analyst framing at its most literal. A new analyst who does not have the information to answer a question should say so, not make something up. The system does the same. The memo draft is still useful in this state because it tells you exactly what is missing and what can be completed once it arrives.

Technical Deep Dive

Challenge 1: Provenance Tracking

Problem

Values flow through extraction, normalization, metric computation, and analysis. Losing provenance at any step breaks trust. And in regulated lending, trust is not optional.

Solution

Provenance is a first class object that survives every transformation:

Extraction attaches document ID, page, bounding box, and table cell references.
Normalization preserves provenance when mapping to canonical line items.
Metric computation records calculation steps and source traces for every input.
Analysis claims reference metric identifiers so they can be validated.

flowchart TB subgraph PDF["📄 Source Document"] P1["Page 3, Cell B8
EBITDA: $1,670,000
bbox: [120,240,280,255]"] end subgraph EXT["🔍 Extraction Layer"] E1["line_item: {
label: 'EBITDA',
value: 1670000,
provenance: {
doc_id: 'doc_xyz',
page: 3,
cell: 'B8',
confidence: 0.95
}
}"] end subgraph WB["📊 Workbook Layer"] W1["income_statement.
ebitda.values['2024'] = {
value: 1670000,
provenance: [→doc_xyz:3:B8]
}"] end subgraph MET["📐 Metric Layer"] M1["credit.dscr.v1_2024 = {
value: 1.80x,
formula: ebitda/debt_service,
calculation_steps: [
{retrieve: ebitda,
sources: [→doc_xyz:3:B8]},
{retrieve: debt_service,
sources: [→doc_xyz:3:B9]},
{calculate: 1.80}
]
}"] end subgraph CLAIM["💬 Claim Layer"] C1["'Strong debt coverage
with DSCR of 1.80x'

metric_id: credit.dscr.v1_2024
↓
Fully traceable to source"] end PDF --> EXT EXT --> WB WB --> MET MET --> CLAIM style PDF fill:#fff3e0 style EXT fill:#e1f5fe style WB fill:#e8f5e9 style MET fill:#e8f5e9 style CLAIM fill:#f3e5f5

Why it matters

This turns audit questions into direct lookups instead of investigations. When a reviewer asks where a value came from, you can point to a specific page and cell and show the entire chain of transformations that preserved it. The junior analyst does not need to remember. The system does.

Challenge 2: Agent Orchestration

Problem

Underwriting pipelines fail in production when they are either too rigid to handle messy documents or too flexible to remain auditable. You need throughput without losing determinism.

Solution

The platform runs as a staged pipeline with explicit boundaries:

Extraction runs in parallel per document for throughput.
Normalization and metric computation run in deterministic sequence.
Review gates progression based on findings severity and mode.
Each stage emits a versioned envelope with deterministic idempotency keys, enabling replay and caching.

Handling conflicts

Deterministic computation is the source of truth for all metrics.
LLM generated claims are required to reference metric identifiers.
Mismatches become findings rather than silent inconsistencies.

Performance considerations

Concurrency is configurable for extraction.
Chunking escalates when extraction yield is too low.
Fail-fast thresholds stop wasting calls when outputs are repeatedly malformed.

Challenge 3: Determinism and Safety

Problem

Allowing untrusted extracted content to influence formulas introduces both security risk and audit instability. You would not let a junior analyst write the financial model in production without review. The same principle applies here.

Solution

Metric formulas are evaluated via AST parsing with a strict operator whitelist.
Only arithmetic operations are allowed. No conditionals, no code execution.
Metric registries are versioned so results are reproducible across time.

Tradeoffs

The formula language is intentionally limited. More complex metrics require decomposition into simpler metrics or pre-computed inputs. This is a deliberate trade: flexibility is sacrificed to guarantee auditability.

Where It Breaks

Failure case

Low quality scanned PDFs introduce OCR errors: missing negatives, shifted columns, and ambiguous period labels.

Why it fails

The system is designed not to guess. Determinism means incorrect inputs remain visible rather than being silently "fixed" by a probabilistic component. A junior analyst who does not know the answer should flag it, not fabricate one. This system does the same.

Human role

You either supply corrected documents, manually verify values, or override findings with documented evidence when policy allows it.

What This Means Beyond Underwriting

The junior analyst framing is not specific to lending. It applies anywhere LLMs are being integrated into high-stakes workflows:

Legal: Contract review where a missed clause has real consequences.
Medical: Clinical decision support where a hallucinated reference is dangerous.
Engineering: Code review where an unchecked suggestion ships a vulnerability.
Finance: Any system where a wrong number reaches a decision maker.

The pattern is the same in every case. LLMs are fast, capable, and useful. They are also wrong in ways that are hard to predict and easy to miss. The answer is not to trust them less. It is to build systems that do not require trusting them.

Treat them like junior team members. Validate their work. Require citations. Flag inconsistencies. Let humans make the final call.

That is what Provenance does for underwriting. The architecture is specific. The principle is not.

What's Next

Gaps for production

Pre-extraction document quality scoring and routing
Domain-specific normalization packs for nonprofit and insurance statements
Analyst-facing UI optimized for provenance navigation and override workflows

Technical debt

Provenance storage overhead and query complexity
More expressive but still deterministic metric logic
Better handling of multi-period and mixed format submissions

Future improvements

Automated need lists driven by failed dependencies and review findings
Scenario analysis using the same deterministic metric registry
Portfolio-level benchmarking across peer deals using consistent definitions

Provenance is currently a private project. If you are building AI systems for regulated environments and want to talk architecture, reach out.

Provenance: Treating LLMs Like Junior Analysts in Commercial Credit Underwriting

An audit-first architecture for high-stakes AI systems

Opening

Background: Why This Problem Matters

What makes underwriting slow or risky

Why existing tools fail regulated lenders

Architecture: Neuro-Symbolic by Design

High-level design

The junior analyst analogy

Why this separation matters

Six-stage pipeline

A Deal Walkthrough

The Scenario

What each agent found

Deterministic calculations

The junior analyst gets caught

Beyond the Numbers: Management, Narrative, and Cross-Document Analysis

Management and leadership verification

Cross-document consistency checks

The memo draft

When the junior analyst does not have enough to work with

Technical Deep Dive

Challenge 1: Provenance Tracking

Problem

Solution

Why it matters

Challenge 2: Agent Orchestration

Problem

Solution

Handling conflicts

Performance considerations

Challenge 3: Determinism and Safety

Problem

Solution

Tradeoffs

Where It Breaks

Failure case

Why it fails

Human role

What This Means Beyond Underwriting

What's Next

Gaps for production

Technical debt

Future improvements