Part 2 of a series. The first post diagnosed the intent deficit – the gap between what language models can produce when appropriately directed and what they deliver under normal operating conditions. This post describes the reasoning architecture we have built to address it. The third post draws out the implications for public-sector capacity building. The fourth post shows the mechanisms described here in concrete paired cases.
The problem, as the previous post set it out, is not that language models lack capability. The capability is present; the conditions of default generation do not elicit it. The model that produces audience-agnostic prose when left to generate freely substantially recalibrates when asked to construct a specific audience model first. The model that introduces citation errors accurately verifies when auditing is required as an explicit workflow step. What is missing from default operation is not knowledge but the reasoning discipline that would convert knowledge into work calibrated to a specific purpose, audience, and context.
The question this post answers is what a reasoning architecture that supplies that discipline looks like. The architecture described here has been built through iterative deployment on substantive knowledge work – policy analysis, institutional documentation, design work, research synthesis – in the range of conditions where the intent deficit is most visible. It is deployable as a single context document loaded at session start, with optional extensions in environments that support filesystem persistence. Rather than a prompt in the ordinary sense, or a productivity technique, it is a set of mechanisms addressing distinct sub-problems within the larger challenge of keeping reasoning from purpose active across a long session.
Mechanically, this architecture is not a novel technical system. What it requires, at its minimum, is a skill file – a structured document containing the reasoning framework, the per-decision questions, and the protocol-level checks – loaded into the AI interface as context at the start of a session. Major AI interfaces, including claude.ai, ChatGPT, and the Claude Code CLI, already support this either directly or through minor configuration. The mechanism is architecturally simple: a document the model reads at session start that shapes how it approaches the work. What makes the architecture substantive is not the mechanism but what the skill file contains – the specific reasoning patterns, the protocol structures, and, critically, the accompanying failure catalogues that encode accumulated observation of how work goes wrong in specific domains.
This simplicity matters for who can deploy the architecture and under what conditions. A skill file is a document, and so is a failure catalogue; an institution with a working AI deployment and the discipline to maintain these documents as living artifacts has, at the level of technical infrastructure, everything it needs. The architecture does not require a proprietary platform, specialised tooling beyond what is already available, or developer access to model internals. What it requires is the harder thing: the observational work of building the documents, and the institutional discipline to keep them current as new failure modes surface and existing ones are resolved. Accumulated institutional knowledge – the specific patterns of how things go wrong in this ministry's document formats, or this organisation's audiences, or this kind of analysis – is what converts a generic skill file into a document that produces disproportionate quality improvements. The architecture's power scales with the specificity of what has been catalogued, which is a function of attention rather than of engineering.
Six functional groupings, organised by the sub-problem each addresses.
The first and most important mechanism operates before any substantive generation. Its purpose is to ensure that the model has an interrogated understanding of the task's purpose, audience, and context – not an assumed one – before it commits to any output.
At the centre of this stage is a reasoning pattern we call the three theories of mind. The pattern does the same work that a senior practitioner does implicitly when a junior colleague asks for direction: it constructs three specific profiles whose intersection determines every subsequent decision.
The first profile is of the assigner – the person or institution requesting the work, understood at the level of what they are actually trying to accomplish by producing this deliverable. Not the task description, which is often a local instruction, but the underlying goal: what success looks like from the assigner's position, what failure would look like, what is at stake in how the deliverable is received. An assigner profile that reproduces the task description is not a profile. It is a restatement.
The second profile is of the ideal practitioner – the expert who would produce this work at the highest level if given the same brief. What does their training look like? What would they know that is relevant to this task? What are the quality signals they would attend to? What errors, specifically, would they recognise on sight as unacceptable? The purpose of this profile is to provide a reference point against which draft work can be evaluated. A choice that the ideal practitioner would not make needs affirmative justification; it cannot be left as default.
The third profile is of the audience – the specific person or group who will actually encounter and use the deliverable. Not an abstract category ("officials," "readers"), but a specific profile: what do they already know, what do they assume, what are they under pressure from, how much attention will they actually have, what would cause them to disengage, what would cause them to reject the work regardless of its recommendations. Where the deliverable has multiple audiences – a senior decision-maker who decides and a junior staffer who implements – both must be modelled, and the tension between what each needs must be resolved deliberately rather than collapsed into a composite reader who is actually no one.
Two further steps follow the three profiles before the framing check is produced. The first is an explicit move from output-as-end to output-as-means: the deliverable is not the goal; the deliverable is the means by which a goal is accomplished. A proposal is not the goal; the goal is the funder's support. A training deck is not the goal; the goal is that the audience leaves equipped to do the thing the training is preparing them for. A policy brief is not the goal; the goal is that the decision-maker moves in the direction the evidence warrants. Treating the output as the end – drafting "a good proposal" – produces work calibrated to conventions of the genre. Treating the output as the means – drafting the proposal that would secure this specific funder's support from this specific position – produces work calibrated to the goal the deliverable is there to serve. This is the step at which the assigner profile, properly constructed, produces a reframing of the task. It is also the step most reliably skipped by default generation, because the task description is usually phrased in output-as-end terms and pattern-matching to output-as-end terms is cheap.
The second further step is deriving decisions from the profiles: before any drafting, the profiles and the means-clarified goal are used to produce explicit decisions about what the deliverable will do. The content list comes first – both what to include and, often the more consequential choice, what to exclude, because the audience profile often dictates content removal before content creation, especially where the audience already knows material that default generation would include. Then the structural choices that will serve this audience given their specific working conditions, the register and tone that the intersection of assigner, expert, and audience profiles requires, and a named set of failure modes the deliverable must avoid – the specific things that would cause the audience to reject the work regardless of its other qualities. Finally, the approach principles and the element-level typology, specified in enough detail that execution does not require re-derivation. These derived decisions sit between the profiles and the framing check as a substantive intermediate artifact; they are what turn the profiles from an exercise into a working foundation.
The three profiles, the output-as-means clarification, and the derived decisions together produce what we call a framing document – a short written artifact, produced before substantive work begins, that surfaces the specific judgement calls being made. The framing document is not ceremony. Its self-test is whether any element of it is something the user could not plausibly push back on. "I understand you want a brief on topic X for audience Y" fails the test: it restates what the user already said. A framing document that passes the test contains disputable claims – the audience is being modelled as skimming senior officials rather than as careful technical readers, and if the audience is actually careful technical readers the structure will need to change; the deliverable is being treated as an argument rather than as an analytical overview, and if the intent is to present options for evaluation the tone will need to shift; a specific ambiguity in the brief has been resolved one way with the resolution made visible. The framing document is an opportunity for the user to correct the model's interpretation of the task before hours of work go into an interpretation the user did not intend.
This pattern is identifiable in experienced human practitioners. A senior professional briefed by a client reflects back not the task description but their interpretation of it – "I'm treating this as an argument for a specific response rather than as an analytical comparison of options, because the decision-maker will commit rather than evaluate; correct me if that's wrong." The framing check structures this reflection and makes it a required gate. Work does not proceed until the framing is confirmed or corrected. This is what experienced practitioners do implicitly and what the architecture requires explicitly.
Framing established before work begins is necessary and insufficient. As the session proceeds, individual sentences, elements, structural choices, and code patterns accumulate. At each of these decision points, the model faces a continuous choice: reason from the established purpose, or take the path of least cognitive resistance. The mechanisms in this grouping keep that choice active at the element level throughout execution.
"The path of least cognitive resistance" is not a loose phrase here; it points at something the architecture as a whole is designed around. The behavioural-science literature on dual-process cognition, developed most influentially by Kahneman and Tversky and popularised in Kahneman's Thinking, Fast and Slow, distinguishes between System 1 – fast, automatic, pattern-matching, effortless – and System 2 – slow, deliberate, analytical, effortful. System 1 is what is running most of the time. System 2 is engaged only when something triggers it, and even when engaged, it is cognitively costly and therefore reluctant to stay engaged for long. The standard picture of skilled human performance is not System 2 thinking at every decision, which would be exhausting and impossible to sustain; it is System 1 patterns trained, over time, to produce the outcomes that a deliberative System 2 would endorse if it were checking. The experienced practitioner's judgement is, in this framing, not continuous System 2 effort but System 1 patterns that have been shaped by enough prior System 2 reflection to produce reliably good defaults.
The habit-formation literature describes the same phenomenon from the other direction. Habits research, from Wood and Neal onward, treats habits as automatic responses cued by context, developed through repetition under conditions that reinforce the specific behaviour the habit is supposed to produce. Habit change does not work by relying on willpower to override automatic responses – willpower is limited and erodes under load. It works by redesigning the context so that the automatic response is the one that serves the goal. In cognitive-behavioural therapy, a related mechanism: the work of therapy is often not to suppress automatic patterns but to develop new automatic patterns that replace them, through deliberate practice that eventually becomes default.
The reframing this produces for the reasoning architecture is useful. The question is not how to keep the model running in effortful deliberative mode at every decision, which would be neither possible nor desirable even if the model's architecture supported the distinction. It is how to structure the context such that the default behaviour – the path the model will take when not actively reflecting – is the path that leads to good outcomes. This is the design problem the architecture is addressing, and each of the six groupings plays a specific role in it. The pre-work framing of grouping 1 is priming: it shapes the default trajectory of subsequent generation before the session has accumulated the content that will otherwise pull it toward genre averages. Per-decision checkpoints, grouping 2, function as triggers: a small explicit question at each significant decision that activates reflection long enough to catch a default before it ships, without requiring continuous deliberative effort between decisions. Grouping 3's persistence mechanisms maintain that priming against the forces that would erode it, so that the defaults established at the start remain the defaults at session hour four. The domain references of grouping 4 calibrate what those defaults should be for the specific genre being operated in. Grouping 5 – the mechanical process gates – acknowledges that some failures are not reliably caught by any amount of attention and require checks that run regardless. Grouping 6's reasoning visibility closes the loop by making the defaults visible enough to the user that they can correct the ones that are wrong, which over time improves the priming the next session starts from.
What the architecture is not trying to do is force the model into perpetual System 2. That approach would fail structurally, for the same reason continuous vigilance fails in human practitioners: the load is too high to sustain, and the costliness of deliberative checking eventually produces its own error mode – skipped checks, exhausted attention, the eventual collapse back into whatever defaults the system has when attention lapses. The architecture is instead trying to build a context in which the path of least resistance is, more often than it would otherwise be, a good one – and in which the specific decisions where reflection is actually needed are surfaced by design rather than left to whether the model happens to notice them.
The core tool is the per-decision checkpoint. For every significant element – every sentence with analytical weight, every design element with visual weight, every structural choice, every coded abstraction – the minimum requirement is two questions that cannot be answered by rote: what is this specific element trying to accomplish within the larger deliverable, and am I choosing this because it is the best option, or because it is the default? Both questions require brief genuine reasoning about the element in question. If the answer to the second is "default," the decision needs to be re-examined; genuine evaluation of alternatives is the required next step.
For decisions at higher scale – overall structure, argument order, tone, framing, any choice that shapes how the entire deliverable lands – a longer checkpoint applies: what this element is trying to accomplish; what the realistic alternatives are; which alternative best serves the purpose, audience, and image established in the framing; how the intended audience would perceive this choice; whether it coheres with the other decisions made so far; whether the specific context – the size, position, medium, surroundings – changes what the right answer is; and whether the choice is being made because it is the best option or because it is a default.
The sixth question deserves explicit attention, because it addresses a failure mode that the previous post in this series identified as shared between AI-generated work and the work of inexperienced human practitioners. Both tend to apply context-agnostic descriptions of elements ("use a bar chart for categorical data," "extract helper functions when code is reused") without reference to the specific context of the task at hand. The same chart type that serves a full-page data feature damages a sidebar because it is unreadable at that scale. The same code abstraction that improves maintainability in production software obscures the data flow in a research script whose readers are following logic linearly. What distinguishes the experienced practitioner is not that they have more rules but that their rules come pre-conditioned on context in a way that junior practitioners' and language models' rules do not. The explicit context question is how that conditioning is forced for a generator that would otherwise apply rules context-free.
Per-decision checkpoints operate at the moment of each decision. But as a session grows – as tokens accumulate, as tool outputs fill the context window, as draft sections proliferate, as compaction eventually drops what was established at the start – the reasoning framework that was established at session start is at risk of degrading. The mechanisms in this grouping address the structural problem of context attenuation directly.
The first is the element typology, produced at the moment the framing is finalised and before execution begins. The typology is a pre-derived design system for the specific project: not type selections in the abstract, but full visual and structural specifications for each class of element the deliverable will contain, derived from the theories of mind and the understanding of what the deliverable needs to accomplish. For a website, this covers the hero section (what it communicates, what it must not do, what visual treatment serves the purpose), data callouts (the test each must pass to be included, what makes a number worth foregrounding for this audience), navigation structure (the logic by which users find what they need), and for each dataset, the full visual treatment – not just "a chart" but what kind of chart, what colour palette and why, what typography and scale, what annotations and why. For a document, the typology covers heading treatments, table styles, callout conventions, visualisation treatments.
The critical property of the typology is that the decisions it records were made at the moment when the theories of mind and the framing were at full weight, before any execution had begun and any context had attenuated. When execution reaches a particular element three hours later, the decision is already made – not re-derived from scratch under conditions where the original reasoning is weakest, but looked up from a persisted record of reasoning done when the conditions were strongest. The typology also prevents cross-element inconsistency, because every element has been pre-specified rather than independently pattern-matched.
The second is the scratchpad, a running log updated throughout the project. Its purpose is to prevent user corrections from being acknowledged in the moment and silently re-violated in subsequent decisions. Four kinds of entries belong in the scratchpad. Feedback from the user, logged with enough context to be useful later – "the palette reads as amateurish" is less useful than a note specifying what about the palette was amateurish and what direction the correction pointed toward. Superseded decisions – when earlier decisions are overridden by later feedback, the old entry is marked superseded and the new direction recorded, so the model does not revert to the rejected approach when the context window has degraded. The model's own mid-session realisations – when the model notices during execution that an earlier decision was wrong or that a pattern is not serving the purpose, logging it prevents the mistake from recurring. And principles to watch for – if a correction reveals a broader pattern (the user consistently dislikes template-driven choices; the audience requires a more formal register than the model defaulted to), the general principle is noted so it can be applied proactively to decisions the user has not yet reviewed.
In filesystem environments, the typology and scratchpad are written to disk as persistent artifacts and re-read before each major decision. In chat interfaces without filesystem access, they are produced visibly in the conversation and re-stated at natural breakpoints in the session. Either way, their function is the same: they hold the reasoning framework at full weight independently of how much context has accumulated since the session started.
A general framework for reasoning from purpose is not sufficient, because different domains of knowledge work have different quality signals, different failure modes, and different calibration requirements. What a policy brief needs structurally is not the same as what a website needs. What a design language signals to a government audience is not what the same language signals to a tech audience. What "good code" means in a research script is not what it means in production software.
The architecture addresses this through domain reference libraries – structured documents, one per domain, each built from observation of what goes wrong in that domain specifically and what the quality calibration requires. A reference on persuasive communications addresses how proposals and advocacy work should calibrate tone for different kinds of audiences, what differentiation by substance rather than by attack looks like, how moral force can be made load-bearing without being performed. A reference on documents and reports addresses front-loaded structure, executive summary thresholds, the design language signals that establish institutional credibility with different professional readerships. A reference on data and evidence addresses when to highlight numbers, what makes a datum worth prominent placement, how to distinguish activity metrics from impact claims. A reference on websites and institutional UI addresses the difference between institutional design language and startup-SaaS defaults, how navigation should reflect what users actually need to find, what specific failures – the stat callout showing a number the story does not need, the testimonial carousel whose testimonials do not earn their place – to watch for.
Each reference is consulted before substantive work in that domain begins. The reference supplies what the general framework cannot: the domain-specific knowledge of what "best available choice" means in the genre, and the specific failures that domain is especially prone to. Without domain references, the general framework maintains reasoning discipline but gives that reasoning no specific content to reason from. With them, the reasoning is calibrated to the actual quality standards of the genre the deliverable is being produced within.
A property of the reference architecture that matters in practice: the references are extensible and are intended to be extended. A reference built from repeated observation of what goes wrong in a specific domain, in a specific institution, with specific kinds of audiences, is more valuable than any generic guide, because it captures the particular ways work actually breaks in that particular context. Institutions that maintain domain references as living documents, updated from each round of correction, build an accumulating asset whose quality is a function of how disciplined the correction loop has been rather than of how well any initial version was written.
Some failure modes are not addressed adequately by reasoning discipline alone, even when reasoning discipline is active. Three in particular require mechanical process gates: citation integrity, visual output quality, and content preservation during format conversion.
The first gate is the evidence ledger and claim audit. When the model makes an assertion that it attributes to a source, the source is logged to a structured list at the moment of attribution, along with the specific claim being made and the scope being asserted. Before delivery, every entry in the ledger is audited against the source it cites – not against whether the source exists, which is the weak form of citation checking that a search can accomplish, but against whether the source actually supports the specific claim at the specific scope being asserted. This is a stronger check than standard verification, and it catches the specific and dangerous failure mode that standard verification does not: the real citation to a real author on an adjacent topic, whose actual work does not say what the text claims. The ledger makes this check mandatory rather than optional, and produces a written record that the audit occurred. When the audit reveals a mismatch, the claim is weakened or cut rather than silently preserved with a citation that does not support it.
The second gate is visual QA. When the deliverable is visual – a document with layout, a slide deck, a website, a visualisation – the model does not treat the generation as complete when the code or markup has been written. The rendered output must be produced, inspected as image, and compared against the typology and the purpose before the work is considered done. Rendering is not optional or deferred: it is a required step. Inspection is an active pass – overall composition, major elements, all edges for clipping or overflow, transitions between pages for consistency, text legibility at the intended display size, image resolution at render size, alignment between related elements. Specific failure modes are looked for: text clipping at containers or page boundaries, overlapping elements, inconsistent margins across pages, orphaned headings separated from their content, mobile failures. Issues are fixed, re-rendered, and re-inspected until the output is clean.
The value of a failure catalogue at this level is that it moves from abstract awareness to specific checks. A catalogue that says "watch for footer overlap on content-dense slides" is more useful than one that says "check layouts are clean," but a catalogue that specifies the dimensional check – for a slide deck of standard dimensions, if a table begins at vertical position y and has n rows of height h, the table ends at y + nh, and must clear the footer at its stated position – is more useful still, because it gives the model a specific calculation to run rather than a visual judgement to make. The accumulation of this kind of specificity is what converts a generic skill file into an institutional asset. Each specific failure mode that has been observed, characterised precisely, and added to the catalogue reduces the class of that failure in subsequent work. Over time, a catalogue built from repeated correction becomes more valuable to an institution than any generic guide, because what it captures is the specific ways work actually goes wrong in that institution's specific contexts.
The principle underlying visual QA is that the rendered image is the authoritative representation of what the document actually communicates to its reader, and code-level inspection – even careful code-level inspection – is reviewing an abstraction of that representation rather than the representation itself. A table with rows that overflow into the footer is visible in the rendered image; it may not be obvious in the code that produced it. A chart whose colours are unreadable to a user with common forms of colour vision deficiency shows up in the render; it does not show up in the source. The same principle applies in reverse when extracting information from documents that use visual conventions – PDFs with highlighted elements, forms with colour-coded fields, designed front pages with embedded metadata. Rendering the relevant pages as images and extracting from what the image shows is more reliable than extracting from the markup, because the conventions are encoded in the visual layer rather than in the structural layer.
The third gate is content preservation during format changes. When the task is to convert a document – from one format to another, from one length to another, from one register to another – the model must treat the task as transformation of existing text rather than as regeneration. A content fingerprint is captured from the source before conversion begins: a structured record of what claims, data, examples, and specific wording the source contains. After conversion, the output is compared against the fingerprint to verify that the content has survived the format change. Where substantive content has been lost or altered, it is either restored or the loss is flagged explicitly rather than silently accepted. This addresses the specific failure mode – silent rewriting during format conversion – that is among the most dangerous of the six discussed in the previous post, because it produces output that looks like a faithful conversion but is not.
The final mechanism is a configuration and behavioural rule rather than a reasoning protocol. It is also the mechanism that compounds the value of every other mechanism in the architecture.
The rule is that the model reasons as though the user can see the reasoning – because in most deployment environments the user can see it, if the interface exposes reasoning traces and the model treats them as part of the work product rather than as internal scratch. Reasoning visibility serves three functions. It provides a second line of defence against drift: when the model's reasoning departs from the task's purpose, the user can see the departure before it becomes output and intervene while correction is cheap. It builds the user's understanding of how the model is thinking about the task in a way that makes their corrections more precise over time. And it establishes a correction loop that improves the quality of subsequent sessions, because the patterns the user corrects become knowledge the user carries into how they set up the next task.
None of these three functions operates when reasoning is hidden. A model that produces polished output without exposing the reasoning that produced it places the entire quality burden on output-level review. Reasoning-level review is cheaper and catches failures earlier. Making it available – through configuration flags that enable thinking traces, through the explicit practice of sharing reasoning in conversation, through the production of written framing and typology artifacts – converts the model from a generation engine that the user edits after the fact into a reasoning partner that the user can steer in real time.
The capacity-building implications of this rule, and specifically what it means for contexts where experienced practitioners are the structurally scarce resource, are developed in the third post of this series.
The six groupings above describe what the architecture does. An equally important question is where the mechanisms live – because language model behaviour during generation is governed not by a single instruction set but by a hierarchy of context sources, each with different properties of persistence, authority, and reach. Understanding the hierarchy matters both for understanding why the architecture is structured the way it is, and for understanding its limits relative to what deployment-level or training-level interventions could achieve.
Five levels, in descending order of binding strength.
System prompt. The highest level. Present at every generation call, not diluted by session length, takes precedence when in conflict with other context. Practitioners working with general-purpose AI interfaces typically do not have access to this level. Developers deploying AI tools do. Rules that live here are rules that hold.
Project configuration. Automatically loaded at session start, present throughout. Less durable than system prompt – attenuates in sufficiently long sessions – but substantially more durable than instructions given in conversation. Requires a tool that supports persistent project configuration; available in development environments, not in chat interfaces.
Skill and protocol files. The primary level at which this architecture operates for most practitioners. A skill file loaded at session start supplies the full reasoning framework in a single document. It sits below project configuration in automatic persistence, but it is the level available in any interface – a practitioner can load the architecture as conversation context at the start of a session in any tool, making it active without system-level access.
Filesystem artifacts. In a specific sense the most persistent layer. Files on disk – framing documents, typologies, scratchpads, evidence ledgers – exist beyond the reach of context degradation and can be re-read at any decision point, regardless of what has happened to the context window since they were written. A rule stated once in conversation can be outweighed by subsequent content; the same rule written to a file and re-read before each major decision is available at full weight whenever it is consulted. Requires a tool with filesystem access.
Conversation history. The weakest binding level for persistent rules. Instructions given in conversation are subject to the nearest-instruction override: as the session progresses and context fills, earlier instructions lose influence over the current generation. The audience model established carefully at session start may be producing audience-agnostic prose forty thousand tokens later because its careful establishment, done once, is no longer actively constraining generation in the presence of the immediate sentence being completed.
The problem the hierarchy is designed to address is the attention attenuation described in the previous post. More proximate content exerts stronger influence over the current completion than more distal content. This is not a bug in the mechanism; it is how attention works. A model given a thorough audience model at session start will, by session midpoint, be generating primarily under the influence of the most recent tool outputs, draft sections, and user messages. The audience model is still in the context window. It is simply losing the weight competition.
Two consequences follow. First, any purely instruction-based approach to maintaining reasoning discipline will degrade as the session grows, regardless of how well the instructions were written. "Reason carefully about the audience" is salient in the first few hundred tokens and outweighed by session midpoint. The failure is structural, not attributable to the prompt.
Second, the architecture's distribution of mechanisms across binding levels is a response to this structural property. The framing check produces a written artifact specifically so the audience model has persistence beyond its original statement in conversation. The typology materialises design reasoning at its strongest moment so it is available as look-up rather than re-derivation when execution reaches each element. The scratchpad exists to re-surface corrections that the conversation history has buried. The instruction to re-anchor on the framing document before major decisions exists precisely because the influence of that document is not automatic; it must be actively reinstated. Each of these is a workaround for operating at a binding level where the rule does not hold by itself.
This also defines the architecture's ceiling. If the same principles were implemented at system-prompt level – by AI developers building knowledge work tools with the reasoning architecture embedded in the governing context – most of the compensating overhead would become unnecessary. The framing check would not need to produce a written artifact for persistence; the audience model would be always-present at full weight. Domain references would be automatically loaded based on task classification rather than requiring the practitioner to invoke them. The per-decision checkpoint would not need to be reinstated after compaction; it would never have attenuated. The discipline the scaffold works to maintain through multiple reinforcing mechanisms would, at stronger binding levels, be a property of the operating environment rather than of the practitioner's session setup.
This is worth being explicit about, because it bears on how AI developers should think about building for substantive knowledge work. The architecture described here is not the maximum of what is possible. It is the maximum of what is available to practitioners working outside the system-prompt level, which for most of the world is the level that actually matters, because most people using language models for serious work are not in positions to modify system prompts. What the architecture demonstrates is that reasoning discipline can be imposed externally even at this level; what the architecture does not demonstrate is the absence of room for AI developers to do substantially more by embedding the same mechanisms at stronger binding.
Some failures the architecture manages rather than eliminates. The training distribution's pull toward genre defaults is the root cause of most of the failures in the previous post, and instructions – even at system-prompt level – work by adding a constraint on top of that pull rather than by changing the pull itself. When the constraint is strong and proximate, it holds. When conditions weaken it, the pull reasserts.
Training-level interventions – instruction fine-tuning, reinforcement learning against knowledge work quality signals, constitutional approaches that embed the earn-its-place test as a value rather than a rule – could in principle address the root. A model trained to reason from purpose and audience at every decision, rather than trained on a corpus of knowledge work outputs and then instructed to reason from purpose and audience, would not need the compensating mechanisms to the same degree. The mechanisms would still be useful – the framing check, the evidence ledger, the visual QA gate add value at any baseline – but they would be adding precision to already-better defaults rather than compensating for defaults that actively pull against quality.
This matters for how the research agenda around knowledge work quality should be framed. The failure modes catalogued in the previous post are not edge cases; they are the normal output of capable models operating on the normal tasks practitioners perform. No existing capability benchmark measures this class of failure, because benchmarks measure capability under controlled conditions rather than delivered output under deployment conditions. The work of building evaluation instruments that capture the gap between capability and delivered output is, in our view, a central part of what is needed for the next stage of model development to address what practitioners actually encounter rather than what controlled tests select for.
Each of the six groupings above addresses a failure mode the others leave open. An architecture that implements only the pre-work framing (grouping 1) loses that framing's effect by session midpoint as context fills. An architecture with long-session persistence mechanisms (grouping 3) but no domain calibration (grouping 4) maintains generic reasoning discipline where genre-specific calibration was needed. Strong reasoning discipline across groupings 1 to 4 without forced verification at high-risk failure points (grouping 5) still produces citation errors and layout failures that reasoning alone does not catch. Reasoning visibility (grouping 6) without the upstream discipline produces transparent but poor reasoning. The mechanisms compound because each compensates for what the others cannot cover.
The architecture also scales to task size. For short, trivial tasks – a paragraph edit, a quick reformulation, a single factual question – the full apparatus would be overhead. The judgement it encodes can be applied informally. For substantive knowledge work tasks – a policy brief, a full analysis, a design project, a research synthesis – the apparatus earns its cost: the hour of framing, typology, and scratchpad setup is smaller than the cost of a full revision round after delivery to a user who catches the problems the framework would have caught.
What the architecture does not do is remove the need for the practitioner's judgement. It requires more of that judgement in specific places – in confirming or correcting the framing, in reviewing the typology, in intervening when reasoning visibility reveals drift – and less in others – in catching failures that the mechanical gates now catch automatically. The overall effect is to redistribute where the practitioner's attention is most needed, toward the points where human judgement is load-bearing and away from the points where mechanical process is sufficient. This matters especially for the capacity-building argument developed in the next post: an architecture that externalises reasoning for correction creates conditions that are structurally different from an architecture that produces finished outputs for review, with implications for what working alongside AI develops in the practitioner rather than substitutes for.