Part 1 of a series on the structural problem of AI in knowledge work. Part 2 describes a reasoning architecture for addressing it. Part 3 draws out the implications for public-sector capacity building. Part 4 shows both the failure modes and the corrections in concrete paired cases.
A language model asked to draft an institutional assessment, a programme implementation plan, or a technical evaluation report will produce something that looks like what was asked for. It will have the expected sections, cite sources, use the framings and terminology the genre conventionally uses, and arrive at recommendations or findings presented in structurally sensible form. Read carelessly, it appears to be competent professional work. Read carefully against what the document is actually supposed to accomplish – for whom, in what context, with what constraints – and a particular class of failure surfaces with considerable regularity. The structure is present. The reasoning that would justify why this structure, these arguments, this evidentiary emphasis, this tone – for this specific institutional reader, making this specific decision, against this specific set of prior assumptions – is not.
The claim is not that language models produce nonsense; they do not. The problem is narrower and more interesting. Models deployed on knowledge work tasks exhibit a consistent gap between what they can do when appropriately directed and what they deliver under normal operating conditions. A model that introduces citation errors when left to generate freely will accurately verify citations against source material when auditing is required as an explicit step. A model that produces audience-agnostic prose will substantially recalibrate its register, structure, and depth when asked to construct a specific audience model before generating. A model that defaults to genre-typical visual elements will interrogate those choices when element-level reasoning is required rather than assumed. The capability is present in each case; what varies is whether the conditions of the task elicit it.
We call the underlying pattern the intent deficit. The model generates by pattern-completion on its training distribution's representation of what the output type looks like, rather than by reasoning from what this specific output needs to accomplish for this specific audience in this specific context. The result, at its most common, is work that matches the surface form of competent practice while failing on the dimensions that actually determine whether the work functions.
This post diagnoses that failure – where it comes from, where it shows up, and why it is not the kind of problem that prompt-level fixes resolve. Subsequent posts describe the reasoning architecture we have built to address it, and draw out the implications for public-sector capacity building in contexts where experienced practitioners are scarce. The discussion below assumes some familiarity with how language models actually work – next-token prediction, statistical pattern-matching rather than comprehension, the way RLHF shapes model behaviour, hallucination as a property of generation rather than an error mode. Readers new to those ideas should read our intro to LLMs for policymakers first; the mechanisms discussed below depend on them.
Before looking at the AI case specifically, the class of failure we are describing is not new. It is, in fact, the characteristic failure mode of junior knowledge-work practitioners across every professional domain. A junior policy researcher produces a brief by implicitly asking what policy briefs look like and generating something that matches that pattern. A junior designer produces a presentation by asking what presentations look like. A junior analyst writes a report structured the way reports are structured. The surface conventions are absorbed by exposure; the underlying reasoning about why those conventions exist, which of them serve the purpose in a given case and which do not, is not.
What distinguishes the experienced practitioner is accumulated judgment – a compression of years of prior work into how they now operate. The senior policy writer does not use an executive summary because policy briefs have executive summaries; they use one because this specific document, for this specific time-pressed decision-maker, benefits from a compressed overview that permits a read/no-read judgement in thirty seconds. For the senior designer, a testimonial carousel is not a thing institutional websites have and therefore a thing their site should have too; it earns its place only if social proof in that specific form serves this specific audience's decision process. A senior analyst refactors a helper function out not because extracting helpers is what good code does but because, in this particular codebase, the abstraction makes the logic more tractable for the people who will read it next.
These practitioners are running a test implicitly, on every significant decision, that their junior colleagues are not yet running. The test is roughly: does this specific element earn its place as the best available choice for this specific purpose, audience, and context, or is it here because it is a default? The test is what experience compresses. It is the difference between competent practice and work that looks like competent practice.
The language model has no such accumulated test, and this is not a metaphor: in the absence of architectural provision for it, the model reverts to its training distribution's statistical average on every decision it faces. It is doing, at industrial speed, exactly what a junior practitioner does when they have not yet internalised the reasoning that their senior colleagues run automatically. The failure modes are structurally identical because the underlying cognitive property is structurally identical: both are generating by pattern-completion on the distribution of outputs they have seen, rather than by reasoning from the specific conditions of the task in front of them.
This parallel matters because it tells us what kind of problem this is: a familiar knowledge-work failure that has always existed in human practice, which AI deployment has made industrial-scale and systematic. And it tells us what kind of solution would work: something structurally analogous to what mentorship and deliberate practice do for a human practitioner. The rest of this series develops that response in detail.
The most visible symptom is what many people have called default-to-genre: an executive summary appearing because policy briefs have executive summaries, a three-column layout appearing because institutional websites have three-column layouts, a bar chart appearing because the data is categorical. These symptoms are real, but framing the problem as "the model uses default patterns" is narrower than what the deficit actually covers. The same underlying process – decision without active reasoning from purpose, audience, or context – produces failures that look nothing like genre defaults.
Six archetypal failure modes, each of which we see repeatedly in AI-assisted knowledge work:
Register inheritance. The model generates a deliverable carrying the emotional register of its immediate session context rather than the register the external audience requires. A proposal drafted after a frank internal conversation about a funding competitor reproduces the comparative framing of that conversation in language that will reach the funder. The internal context was appropriate for internal use; the output will be read by someone who was not party to it and has no reason to share its frame. The failure is not that the model misunderstood; the failure is that it did not reason about the audience shift. The most proximate context saturated generation and carried through.
Over-literal instruction-latching. A user gives thoughtful and specific instructions: be thorough, use prose rather than bullets, explain in depth, show the relationships between ideas clearly. The model latches onto the instructions and produces a dense, carefully-argued document whose operational content – the decisions being proposed, the actions being recommended, the conclusions being reached – is buried several pages in, only reachable through sustained reading. The instruction was followed. The goal – a document whose reader, often a senior decision-maker with minutes rather than hours to give it, can act on without reading in full – was abandoned. The failure mode is the model treating the nearest-stated instruction as the operative goal, rather than using it as an input to reasoning about what would actually serve the underlying purpose.
Evidence substitution by evaluative language. A finding is asserted, and the work of convincing the reader that the finding is significant is done by adjectives – "striking," "compelling," "substantial," "transformative" – rather than by the numbers, the scope, or the concrete consequence the finding carries. The adjectives arrive in place of the evidence that would earn them. This is the model performing confidence rather than conveying it.
Plausible citation. A claim is followed by a citation to a real author on an adjacent topic whose actual work does not support the specific claim being made at the scope being asserted. The model has produced the pattern of "assertion followed by supporting citation" without running the check that the citation actually supports the assertion. The most dangerous version of this failure is not the entirely fabricated reference, which a quick search detects. It is the real reference that does not say what the text claims it says – a failure that survives any verification short of fetching the cited source and reading the relevant passage.
Audience-unaware design. A pedagogical tool for children encountering a concept for the first time is designed as if the user already knows what to attend to and what to ignore. A decomposition visualisation renders bars of equal width regardless of magnitude, breaking the pedagogical point it is meant to make. A number-line shows both numbers already placed, turning what should be a visible transformation into a static statement. Each is a failure of the audience model: the tool was designed without a specific user in view, and so defaulted to an abstract "learner" who already knows the things the tool is supposed to teach.
Content loss during format conversion. A user asks for a document converted to another format. The model treats the task as regeneration rather than transformation. What returns is a document with the requested format and much of the original content silently rewritten, restructured, or abbreviated. No flag is raised; the output looks like what was requested. The failure mode is specific and dangerous because the user, having received a plausibly-formatted file, may not check whether the content of the original survived.
These six patterns are domain-shaped but share a structural cause. None of them is a capability failure in the sense that the model could not have produced better output. They are all failures of reasoning about purpose before generation. Each would have been caught by a reasoning step that was not run: an explicit audience model, an explicit check that the instruction serves the goal rather than displacing it, an explicit test of whether the citation supports the claim, an explicit content-preservation step during transformation. The capability was present. The conditions of the task did not require it to be exercised.
It is tempting to treat these failures as local bugs – the result of imperfect training, solvable by better data, better fine-tuning, or a more carefully worded prompt. This is not wrong, exactly, but it is incomplete. The failures are symptoms of how pattern-completion works as a generation strategy, interacting with how attention distributes its weight across a context window that fills up as a session progresses. Understanding both mechanisms is necessary for seeing why prompt-level responses do not scale.
The first mechanism: pattern-completion on a training distribution. A language model has been exposed to an enormous volume of text, including enormous volumes of knowledge work outputs – policy documents, research papers, reports, essays, slide decks, code. From these, it has learned not the reasoning processes that produced any particular document, but the statistical regularities of what such documents look like. When it is asked to produce a new one, it completes the pattern it has absorbed. This is why the output looks so convincing. It is also why the output is unreliable on dimensions that require reasoning about the specific task at hand rather than about what tasks of this type typically look like.
The training distribution is also not neutral. It skews toward the most common patterns, which in knowledge work means the patterns of competently-executed but not exceptional practice – because competently-executed, templatable, genre-conventional work is massively overrepresented in what has been digitised and indexed. The model has learned what policy briefs usually look like, where "usually" means an average over a large population of briefs, most of which are themselves pattern-matched to genre conventions rather than being products of deep purposive reasoning. In any specific case where the right answer departs from the average – where this audience needs something different, where this context changes what a good choice would be – the model's prior is pulling toward the average and away from the specific correct answer. This is the strongest structural property of how generation works, and no amount of better prompting overcomes it.
The second mechanism: context attenuation under growing session length. Attention, the mechanism by which a transformer model decides which parts of its context to weight most heavily when generating the next token, does not weight uniformly. More proximate content – the most recent sentences, the most recent tool outputs, the most recent instructions, the paragraph being extended – exerts stronger influence than content further back in the context window. This is how the mechanism is designed to work, and its consequences for a long knowledge work session are severe.
The implication is severe for any instruction-based approach to maintaining reasoning discipline. "Reason carefully from the audience's perspective" is a salient instruction in the first few hundred tokens of a session. Fifty thousand tokens later, after the model has processed source documents, drafted sections, responded to user feedback, executed tool calls, and produced intermediate outputs, that instruction has not been forgotten. It has been outweighed. The attention weight now flows primarily to proximate content – the sentence being completed, the tool output just returned, the user's most recent message – not to the careful instruction that set up the session. If the careful audience reasoning was done once at the start and is no longer actively shaping individual decisions, the model has reverted to generating under the influence of proximate context, which is generally the content of the immediate task rather than the audience model.
This is why we see the intent deficit most visibly in long, substantive knowledge work sessions rather than in short interactions. In a two-turn exchange – a question followed by an answer – the instruction and the generation are close enough that the instruction carries full weight. In a four-hour session producing a serious deliverable, the instructions established at session start are competing for attention with everything that has accumulated since. They lose the competition not because the model has ceased to understand them but because the mechanism by which they influence generation is structurally designed to prioritise proximate content.
Together these two properties – pattern-completion pulling toward genre averages, attention attenuation weakening instructions that were not reinstated – produce the intent deficit as a predictable deployment-level phenomenon. It appears specifically in the conditions of real knowledge work: sessions long enough that initial instructions attenuate, tasks substantive enough that generation is not trivial, stakes high enough that the gap between the genre average and the correct specific answer matters. It does not appear in the short, controlled conditions under which capability benchmarks measure model performance. A model that scores at the frontier on standardised benchmarks can simultaneously produce the full catalogue of failures above in deployment – not because these results contradict each other but because the benchmark measures capability while the deployment conditions are what expose whether that capability is being exercised.
Prompt engineering has produced a real toolkit for specific quality dimensions. Chain-of-thought prompting improves step-by-step reasoning on tasks that benefit from decomposition. Few-shot examples calibrate format and register effectively. Role conditioning – "you are a senior policy analyst writing for…" – meaningfully shifts model output. Self-consistency and verification loops catch certain classes of error. These are all real contributions, and none of them is what we are contesting.
What we are contesting is that any of them addresses the class of failure we are describing, which operates structurally across long sessions rather than at a single turn. Prompt engineering techniques are one-time conditioning interventions at context entry. They reshape the model's generation at the moment the prompt is being processed. They do not architecturally ensure that reasoning persists across a full session of substantive work, because the mechanism they act through – placing constraints in context – is subject to the same attention-weight decay as any other context-level intervention.
The failure mode we are describing is therefore not one that an individual prompt attempts to prevent. It is not a failure of any particular decision; it is a failure that emerges across a long chain of decisions, each one of which was locally plausible but collectively drifted from what the task actually needed. No prompt at session start prevents the drift. Better prompts produce better starting conditions. They do not produce mechanisms for maintaining those conditions as the session fills.
This also explains why users who have worked extensively with language models often report that they have learned to get good results through some combination of skill, taste, and repeated correction – without being able to reduce what they have learned to a set of rules that a less experienced user could follow. What they have actually learned is a running discipline of identifying drift and intervening to correct it: noticing when the output is missing the audience, when a recommendation is buried, when a claim needs a verification step, when the conversion has dropped content. The discipline is continuous and distributed across the whole session. It is not encoded in any single prompt; it is encoded in the user's attention and reflexes. That discipline is exactly what the architecture described in the next post tries to encode structurally, so that it does not depend on the user having independently developed it.
There is a broader reason to take this class of failure seriously, beyond the immediate effects on knowledge work output quality. The pattern – surface form adopted in the absence of the reasoning that would connect form to function – is not confined to AI. It is among the most persistent findings in the state capability research literature, where it has been documented and studied under the name isomorphic mimicry.
Andrews, Pritchett, and Woolcock, in their work on the capability trap, use isomorphic mimicry to describe a pattern in which governments adopt the institutional forms of capable states – the organisational charts, the designated procedures, the named reform policies – without developing the underlying function those forms require to work. The forms are present; the function they are supposed to enable is not. The pattern is not an occasional implementation failure. It is, in their analysis, the characteristic failure mode of state capability development in contexts where technical knowledge is abundant but the judgement required to translate that knowledge into effective situated action is scarce.
The structural analogy to the session-level intent deficit is more than metaphor. Both operate by pattern-completion on surface representations in the absence of reasoning from underlying function. A ministry that adopts a modern procurement procedure without the administrative judgement to apply it in context is producing the form of capable procurement without the function. A model that produces a policy brief with the expected structure without the underlying reasoning about why this specific audience needs this specific structure is doing the same thing at a different scale.
Whether these are the same mechanism operating at different scales, or structurally analogous mechanisms with a shared formal description, is a question the theoretical literature would have to resolve. What matters for practice is that the diagnostic is the same and the architectural response has the same shape. In both cases, the surface form is cheap; the reasoning from function is the scarce thing. In both cases, adding more forms – more policy templates at institutional scale, better prompts at session scale – does not close the gap, because the gap is not between having forms and lacking forms. It is between form and the reasoning that would make the form do what it is supposed to do.
This framing matters for how we understand what AI does and does not contribute to knowledge work in institutional contexts. A default-deployed language model is exceptionally good at producing the surface forms of competent knowledge work. It is unreliable, in the specific ways the six failure modes describe, at supplying the reasoning from function that makes those forms actually work. An organisation that deploys AI tools without mechanisms for maintaining reasoning discipline is not getting a scalable substitute for experienced practitioners. It is getting form production at scale without the function. For contexts where experienced practitioners are abundant and AI is supplementary, this may be tolerable – the practitioners supply the function, and the AI accelerates the form. For contexts where experienced practitioners are scarce, the same deployment pattern risks reproducing isomorphic mimicry at industrial speed. The third post in this series develops this implication in detail.
The rest of this series describes a response – not a prompt, and not a productivity technique, but an architecture: a structured set of mechanisms for keeping reasoning active across the full length of a substantive knowledge work session, distributed across multiple levels of the context that governs model behaviour so that no single level bears the full load of maintaining discipline. The mechanisms include specific reasoning patterns that establish purpose and audience before generation begins, per-decision checkpoints that prevent the accumulation of defaults, persistent artifacts that survive context attenuation, domain references that calibrate to specific genre requirements, and mechanical checks at the workflow stages where reasoning alone has been empirically shown to fail.
The architecture operates at a level that practitioners working with general-purpose AI interfaces can deploy without developer access – loaded as a context document at session start, optionally extended with filesystem artifacts in environments that support them. We do not claim it is the only response possible to the intent deficit, or a complete response. Training-level interventions would address the underlying distribution pull in ways that no session-level architecture can. But session-level architecture is what is available to practitioners and institutions now, working with the models that actually exist, and what we have built is an attempt to make that level do as much as it can.
The next post describes the architecture itself, mechanism by mechanism, including the theoretical basis for each and the failure mode it addresses. The one after that turns to the capacity-building implications – what becomes possible when a scaffold designed for reasoning discipline is also designed to externalise reasoning in ways that transfer to the practitioners working alongside it, and what that might mean for developing-country and public-sector contexts where the shortage of experienced practitioners is the binding constraint.