Eval harnesses and the governance posture your auditor will accept

Auditors read documentation before they read code

Every AI audit we have shadowed — across SOC 2 Type II, HITRUST, OCC bank examinations, FDA pre-submission meetings, and DoD risk assessment frameworks — opens the same way. The auditor asks for the governance documents. They read them. They form a hypothesis about how seriously the team takes AI risk. Only then do they look at code, configurations, and live systems. The documentation sets the tone for the entire engagement.

Teams that have current, version-controlled, decision-log-supported governance artifacts move through audits in days. Teams without them spend weeks assembling evidence retroactively while the auditor's confidence drops. The governance posture is observable in five minutes; the technical posture takes hours to assess. The first impression carries a lot of weight.

Eval harness documentation includes more than the harness itself

An auditor reading eval-harness documentation expects to see: the eval suites and their purpose, the eval set composition and curation process, the thresholds and gating logic, the calibration discipline for any LLM-as-judge components, the run cadence and CI integration, the historical run results, the response process for failures, and the policy version history. Each is a separate artifact. The harness code itself is the smallest piece.

We deliver eval harness documentation as a structured set of files versioned alongside the code. The auditor walks through each artifact in order. The harness becomes legible to a non-engineer in minutes, which is exactly what risk and compliance reviewers need to do their work.

Audit artifacts
8–12 documented, version-controlled
SOC 2 evidence requests
~85% satisfied from existing artifacts
Audit cycle time
< 2 weeks with current governance
Audit cycle time without governance
6–10 weeks evidence-assembly under pressure

Model cards are the per-system artifact regulators expect

A model card documents a specific production AI system: its purpose, the model and version, the training and evaluation data composition, the eval results across suites, known limitations, intended-use restrictions, the responsible owner, and the change history. The format originated in academic AI ethics work and has been adopted by FDA SaMD guidance, NIST AI Risk Management Framework, and OCC SR 11-7-style model risk frameworks.

Every production AI system should have a model card. The card is updated on release, version-controlled, and presented to the governance forum during release approval. Auditors who ask 'what is this system' get the model card; auditors who ask 'how do you know it works' get the eval section; auditors who ask 'who is responsible' get the owner section. The model card is the answer to most first-round audit questions.

The decision log carries institutional memory across personnel and time

Every governance forum decision lands in a decision log: the date, the decision, the rationale, the dissent if any, and the owner of follow-on actions. The log is append-only. When the team turns over, the next governance chair reads the log and understands why the current architecture is what it is, why the eval policy includes the categories it does, why a specific model was selected over alternatives. Without the log, this knowledge leaks every personnel transition.

Auditors love decision logs. The presence of a current decision log demonstrates governance discipline. The contents demonstrate that decisions were made deliberately, with documented rationale, and that the team can defend the architecture against scrutiny. We have seen audits resolve favorably solely because the decision log was current and substantive.

Continuous monitoring evidence is the live system's audit footprint

Eval harnesses run on every release; monitoring runs continuously. The monitoring stack records eval-suite scores from production-comparison runs, drift signals, anomaly detection on quality and cost metrics, incident events, and the response to each incident. Auditors ask for monitoring evidence — typically the last 90 days — and the dashboards have to render that evidence cleanly.

We architect monitoring with the audit use case in mind. Dashboards are exportable. Time ranges are configurable. Per-system, per-tenant, per-category breakdowns are available. The auditor doesn't get a screenshot; they get the queryable system. This level of accessibility differentiates a deployment ready for audit from one that is technically capable but operationally opaque.

Regulator-specific overlays are real and need to be designed in

FDA SaMD requires specific clinical-validation documentation, software lifecycle process per IEC 62304, and risk classification per 21 CFR 820.30. OCC bank-supervisory guidance requires SR 11-7 model risk management treating AI systems as models. DoD assurance requires CAS-aligned controls and explainability documentation per the relevant directive. SOC 2 requires evidence of control operation per the trust services criteria.

Each overlay adds artifacts to the base governance posture but doesn't change its shape. We design the base posture to absorb regulatory overlays gracefully — the model card structure accepts FDA-specific fields, the eval policy accepts SR 11-7-style validation requirements, the decision log accepts the additional audit fields. Designing for one regulator and retrofitting for the next is more expensive than designing for the whole envelope from the start.

Independence between development and validation is the structural requirement most teams miss

SR 11-7, NIST AI RMF, and most regulator-style governance demand structural independence between the team that builds the model and the team that validates it. The same engineer cannot 'mark their own homework.' The eval owner reports to a different leadership chain than the model owner. The validation work is documented separately. This is one of the most common audit findings when the structure is missing — and one of the easier fixes if designed in early.

We design enterprise AI capabilities with this independence baked in, even at smaller organizations where the same person may functionally do both. The artifact discipline preserves the structural independence: the eval owner role wears a different hat than the model owner role, decisions are documented from each role separately, and the governance forum reviews both. The structure survives even when individuals are wearing multiple hats.

Our SOC 2 Type II for the AI systems closed in nine days. The auditor's first comment was that they had never seen a model card structure that current. The harness code didn't impress them; the documentation did. We invested in the governance posture from week one and recouped that investment ten times over in the first audit cycle.

— CISO, fintech client with regulated AI deployment

Frequently asked

What governance artifacts do AI auditors expect?

Eight to twelve documented, version-controlled artifacts. Eval harness documentation, eval policy, model cards per production system, decision log, incident playbook, monitoring dashboards, capacity plan, and a roles document at minimum. Regulator-specific overlays add artifacts but don't change the base structure. Each artifact is read first; the code and configurations come second. The governance posture is the audit posture.

What is a model card and why does it matter?

A per-system document covering purpose, model and version, training and evaluation data composition, eval results across suites, known limitations, intended-use restrictions, responsible owner, and change history. The format originated in academic AI ethics work and has been adopted by FDA SaMD guidance, NIST AI RMF, and OCC SR 11-7-style frameworks. It's the answer to most first-round audit questions about a specific system.

Why is a decision log critical?

Because institutional memory persists in the log when personnel turn over. The log records every governance forum decision with date, rationale, dissent, and follow-on actions. The next governance chair reads it and understands the current architecture's why. Auditors read it as evidence of disciplined governance. Without the log, knowledge leaks every transition and audits become exercises in retroactive rationalization.

What does continuous monitoring evidence look like?

Production-comparison eval-suite scores, drift signals, anomaly detection on quality and cost, incident events with response actions, all over a configurable time range (typically 90 days for audits). Auditors get the queryable system, not a screenshot. Per-system, per-tenant, per-category breakdowns are accessible. The monitoring stack is architected with the audit use case in mind from day one.

How do regulator-specific requirements layer on?

As overlays on a base governance posture. FDA SaMD adds clinical validation, IEC 62304, risk classification. OCC SR 11-7 adds model-risk-management framing. DoD assurance adds CAS controls and explainability. SOC 2 adds trust services criteria. The base posture is designed to absorb overlays gracefully — model card structure accepts additional fields, eval policy accepts additional validation requirements. Designing for one regulator and retrofitting is more expensive than designing for the envelope.

What is the structural-independence requirement and why does it matter?

The team that builds the model cannot validate it; the eval owner reports to a different leadership chain than the model owner. SR 11-7, NIST AI RMF, and similar guidance require this independence. Even at smaller organizations where one person wears multiple hats, the artifact discipline preserves structural independence — the eval owner role produces eval artifacts separately from model owner artifacts. Designing this in from week one avoids the most common audit finding when it's missing.