A custom build that actually ships: the eight-to-twenty-four week playbook
Most custom AI projects fail in the gap between demo and production
We have audited a lot of stalled AI projects. The pattern is consistent: a six-week prototype produced a compelling demo, the team celebrated, and then six months disappeared into 'productionizing' that never finished. The demo was 80% of the value at 20% of the cost; the remaining 80% of cost — evals, observability, error handling, ops handover, security review — was scoped as 'a few weeks of cleanup.' It never is.
Our four-phase playbook treats the production hardening phase as a peer of the prototype phase, with its own scope, budget, and gates. Discovery sets up prototype. Prototype proves the approach. Hardening makes it operate. Transfer makes it yours. Skip any of them and you ship a demo.
Phase one: discovery is two-to-four weeks of architecture, not slideware
Discovery is not a meeting series. It is two to four weeks of senior engineers in your environment, mapping the problem, the data, the constraints, and the success criteria. The deliverable is a written architecture document and a prototype scope, both signed off before phase two begins.
Discovery deliverables
- Problem definition with measurable success criteria — not 'improve customer experience' but 'reduce mean time to resolution from 14 minutes to under 4.'
- Data inventory with schema, freshness, access path, and quality assessment.
- Architecture document with model selection rationale, retrieval strategy, evaluation approach, and integration points.
- Prototype scope with explicit in-scope, out-of-scope, and a definition of done.
- Risk register with the failure modes that would kill the deployment and the mitigations.
Phase two: prototype proves the approach against real data
Prototype is three to six weeks of building a working system end-to-end against real data, with real evaluation. Not a notebook, not a slide. A running system that the customer's own team can use, with measurable outputs against the success criteria from discovery.
The cardinal rule is no synthetic data. If we cannot get production data into a sandbox, we get production-equivalent data, but the prototype runs against real distributions. Prototypes built on synthetic data ship demos that fall apart on contact with reality. We have seen this pattern enough times to refuse to ship one.
- Engagement length
- 8–24 wk phase 1 → phase 4
- Team size
- 3–8 senior applied AI engineers
- Pricing
- Fixed-scope per phase, not T&M
- Handover
- Runbook-complete 90-day support tail
Phase three: hardening is where production AI is actually built
Hardening is three to eight weeks of turning the working prototype into a system that operates. This is the phase that demos skip and production cannot. It is also the phase that decides whether the engagement results in a system your team owns or a science project.
What hardening covers
- Evaluation harness running in CI against a curated test set, with regression detection.
- Observability — every model call, every retrieval, every decision logged with reasoning trace.
- Error handling — fallback paths, retry with backoff, graceful degradation when the model is unavailable.
- Security review — auth, secrets, data handling, prompt injection mitigations, jailbreak resistance.
- Cost ceilings and rate limits per tenant or use case.
- Load testing at expected and peak traffic, with documented capacity.
- Runbooks — how to investigate a bad output, how to roll back a model change, how to add a new evaluation.
Phase four: transfer is what makes the engagement actually end
Transfer is two to six weeks of pairing with the customer's team to ensure they can operate, extend, and evolve the system without us. We do not consider an engagement complete until the customer's engineers have shipped a non-trivial change to the system, alone.
The deliverables are documentation — architecture decision records, runbooks, evaluation playbooks — and live working sessions. The Custom AI Software product comes with a 90-day support tail after transfer, but that is a backstop, not a dependency. The system has to operate without us calling in.
Pricing is fixed-scope per phase, not time-and-materials
We price each phase as a fixed scope with named deliverables and named gates. Discovery is the smallest commitment — typically two to four weeks at a known fee — and either of us can decline to proceed to phase two with no obligation. This avoids the worst pattern in custom-software engagements, where time-and-materials projects drift because nobody wants to call out scope creep.
Total engagement cost varies by complexity but typically lands in the $250K–$1.4M range across all four phases. The number that matters more than the sticker is the unit cost: a system that is operating, that your team owns, that has a documented runbook, costs less per year of useful life than a stalled project that never shipped.
Team composition is senior, small, and stable
Engagements run with three to eight senior applied-AI engineers, led by a tech lead with shipped production AI experience. The same people from week one to week twenty-four. We do not staff partners with junior engineers managed by a project manager. The work is too dependent on judgment for that to produce results.
Customer commitment matters too. The engagement requires a named technical owner on your side, with calendar time committed weekly, and access to the engineers who actually operate the systems we are integrating with. Engagements where the customer's team is too busy to be involved are the ones that end up as documentation handovers nobody reads.
What you receive at the end
A production system deployed in your environment, runbooks for your SREs, architecture decision records, evaluation playbooks, and a 90-day support tail. Specific commercial terms — including any source, data, or model-weight rights — are scoped per engagement. The 'open the kimono' moment for how the system works is week one, not the closeout meeting.
The transfer phase is where most engagements I have worked on quietly fall apart. The vendor disappears, the docs are 60% of what you need, and three months later you are paying for an emergency engagement to keep it running. This was the first one where we actually owned what we got. The runbook had a section on the failure mode that hit us in month four. We followed it.
— VP Engineering, healthcare client
When the playbook stretches to 24 weeks instead of 8
The eight-week version is for narrow, well-scoped problems with clean data and a single integration point. The twenty-four-week version is for systems that touch six or more upstream systems, require fine-tuned models, operate in regulated environments, or need extensive evaluation harness build-out. The phasing is the same; the duration of each phase scales with complexity.
Frequently asked
How long does a custom AI engagement actually take?
Eight to twenty-four weeks across four phases: discovery (2–4 weeks), prototype (3–6 weeks), hardening (3–8 weeks), and transfer (2–6 weeks). Narrow, well-scoped problems land at the short end. Systems touching six or more upstream integrations, fine-tuned models, or regulated environments stretch toward the longer end. The phasing stays constant; the duration scales with complexity.
Why does prototype need real data?
Because prototypes built on synthetic data produce demos that fall apart on contact with reality. Real data has skew, edge cases, encoding inconsistencies, and quality issues that synthetic data scrubs out. A prototype that handles real data poorly is a prototype that needs more work; a prototype that handles synthetic data well is a prototype that has not yet met the actual problem.
What does 'hardening' include in a custom AI engagement?
Hardening covers everything that turns a working prototype into a system that operates: evaluation harnesses in CI with regression detection, full observability with reasoning traces, error handling and fallback paths, security review including prompt injection mitigations, cost ceilings, load testing, and runbooks. It is the phase that demos skip and production cannot. Skipping it produces stalled deployments.
What is delivered at handover in a custom AI engagement?
A production system deployed in your environment, with runbooks, architecture decision records, evaluation playbooks, and the documentation your team needs to operate it. Specific commercial terms — including any source, configuration, or model-weight rights — are scoped per engagement. After the 90-day support tail, the system runs day to day without us calling in.
How much does a custom AI engagement cost?
Engagements typically land between $250K and $1.4M across all four phases, with discovery alone running in the $40K–$120K range as the lowest commitment to get started. Pricing is fixed-scope per phase rather than time-and-materials, which keeps scope and cost honest. The unit cost that matters is dollars per year of useful operating life, not sticker price at signing.
Why fixed-scope per phase instead of time-and-materials?
Because time-and-materials engagements drift. Without phase gates, scope creep is invisible until the budget is gone. Fixed-scope-per-phase forces an explicit conversation at every gate about what was learned, what changes, and whether to proceed. Either party can decline to proceed to the next phase with no obligation, which keeps incentives aligned with outcomes rather than billable hours.
What is the 90-day support tail after transfer?
After the transfer phase ends, our team is available for 90 days to handle incidents, answer questions, and review changes the customer's team makes to the system. It is a backstop, not a dependency — the system has to operate without us calling in. The tail exists because surprises surface in the first three months of production operation, and we want to be available when they do.
More from Field Notes
All essays
Engineering Fine-tuning vs RAG vs prompt engineering: when each actually wins
An honest decision tree for fine-tuning, RAG, and prompt engineering — what each does well, what each costs, and how to choose without religion.
Engineering Eval harnesses in CI: what to measure for custom AI systems
Continuous evaluation for custom AI systems — quality, safety, regression, latency, and cost evals that block bad releases before customers see them.
Engineering Safety, red-team, and the failure modes specific to your domain
How to red-team a custom AI system for the failure modes that matter in your domain — beyond generic jailbreaks, into the harm patterns specific to the work.