The isomorphic mimicry problem in AI outputs

A language model asked to draft an institutional assessment, a programme implementation plan, or a technical evaluation report will produce something that looks like what was asked for. It will have the expected sections, cite sources, use the framings and terminology the genre conventionally uses, and arrive at recommendations or findings presented in structurally sensible form. Read carelessly, it appears to be competent professional work. Read carefully against what the document is actually supposed to accomplish – for whom, in what context, with what constraints – and a particular class of failure surfaces with considerable regularity. The structure is present. The reasoning that would justify why this structure, these arguments, this evidentiary emphasis, this tone – for this specific institutional reader, making this specific decision, against this specific set of prior assumptions – is not.

The problem is narrower and more interesting than language models producing nonsense. Models deployed on knowledge work tasks exhibit a consistent gap between what they can do when appropriately directed and what they deliver under normal operating conditions: a model that introduces citation errors when left to generate freely will accurately verify citations against source material when auditing is required as an explicit step; a model that produces audience-agnostic prose will substantially recalibrate its register, structure, and depth when asked to construct a specific audience model before generating; and a model that defaults to genre-typical visual elements will interrogate those choices when element-level reasoning is required rather than assumed. The capability is present in each case; what varies is whether the conditions of the task elicit it.

We call the underlying pattern the intent deficit. The model generates by pattern-completion on its training distribution's representation of what the output type looks like, rather than by reasoning from what this specific output needs to accomplish for this specific audience in this specific context. The result, at its most common, is work that matches the surface form of competent practice while failing on the dimensions that actually determine whether the work functions.

This post diagnoses that failure – where it comes from, where it shows up, and why it is not the kind of problem that prompt-level fixes resolve. Subsequent posts describe the reasoning architecture we have built to address it, and draw out the implications for public-sector capacity building in contexts where experienced practitioners are scarce. The discussion below assumes some familiarity with how language models actually work – next-token prediction, statistical pattern-matching rather than comprehension, the way RLHF shapes model behaviour, hallucination as a property of generation rather than an error mode. Readers new to those ideas should read our intro to LLMs for policymakers first; the mechanisms discussed below depend on them.

The human parallel: how inexperienced practitioners fail

The class of failure we are describing is not new – it is the characteristic failure mode of junior knowledge-work practitioners across every professional domain. A junior policy researcher produces a brief by implicitly asking what policy briefs look like and generating something that matches that pattern; a junior designer produces a presentation by asking what presentations look like; a junior analyst writes a report structured the way reports are structured. The surface conventions are absorbed by exposure; the underlying reasoning about why those conventions exist, and which of them serve the purpose in a given case, is not.

What distinguishes the experienced practitioner is accumulated judgment – a compression of years of prior work into how they now operate. The senior policy writer does not use an executive summary because policy briefs have executive summaries; they use one because this specific document, for this specific time-pressed decision-maker, benefits from a compressed overview that permits a read/no-read judgement in thirty seconds. A testimonial carousel, for the senior designer, earns its place only if social proof in that specific form serves this specific audience's decision process – it is not simply a feature institutional websites typically have. The same logic holds for a senior analyst refactoring a helper function: the abstraction earns its place in this particular codebase because it makes the logic more tractable for the people who will read it next, not because extracting helpers is what good code does.

These practitioners are running a test implicitly, on every significant decision, that their junior colleagues are not yet running. The test is roughly: does this specific element earn its place as the best available choice for this specific purpose, audience, and context, or is it here because it is a default? The test is what experience compresses. It is the difference between competent practice and work that looks like competent practice.

The language model has no such accumulated test, and this is not a metaphor: in the absence of architectural provision for it, the model reverts to its training distribution's statistical average on every decision it faces. It is doing, at industrial speed, exactly what a junior practitioner does when they have not yet internalised the reasoning that their senior colleagues run automatically. The failure modes are structurally identical because the underlying cognitive property is structurally identical: both are generating by pattern-completion on the distribution of outputs they have seen, rather than by reasoning from the specific conditions of the task in front of them.

The parallel identifies what kind of problem this is: a familiar knowledge-work failure that has always existed in human practice, which AI deployment has made industrial-scale and systematic, and whose solution has the same shape as what mentorship and deliberate practice do for a human practitioner. The rest of this series develops that response in detail.

Where the deficit shows up

The most visible symptom is what many people have called default-to-genre: an executive summary appearing because policy briefs have executive summaries, a three-column layout appearing because institutional websites have three-column layouts, a bar chart appearing because the data is categorical. Framing the problem as "the model uses default patterns" is narrower than what the deficit actually covers, because the same underlying process – decision without active reasoning from purpose, audience, or context – produces failures that look nothing like genre defaults.

Six archetypal failure modes recur with enough regularity in AI-assisted knowledge work that each deserves individual naming.

Register inheritance. The model generates a deliverable carrying the emotional register of its immediate session context rather than the register the external audience requires. A proposal drafted after a frank internal conversation about a funding competitor reproduces the comparative framing of that conversation in language that will reach the funder: the internal context was appropriate for internal use, but the output will be read by someone who was not party to it and has no reason to share its frame. The model did not reason about the audience shift; the most proximate context saturated generation and carried through.

Over-literal instruction-latching. A user gives thoughtful and specific instructions: be thorough, use prose rather than bullets, explain in depth, show the relationships between ideas clearly. The model latches onto the instructions and produces a dense, carefully-argued document whose operational content – the decisions being proposed, the actions being recommended, the conclusions being reached – is buried several pages in, only reachable through sustained reading. The instruction was followed. The goal – a document whose reader, often a senior decision-maker with minutes rather than hours to give it, can act on without reading in full – was abandoned. The failure mode is the model treating the nearest-stated instruction as the operative goal, rather than using it as an input to reasoning about what would actually serve the underlying purpose.

Evidence substitution by evaluative language. A finding is asserted, and the work of convincing the reader that the finding is significant is done by adjectives – "striking," "compelling," "substantial," "transformative" – rather than by the numbers, the scope, or the concrete consequence the finding carries. The adjectives arrive in place of the evidence that would earn them, performing confidence rather than conveying it.

Plausible citation. A claim is followed by a citation to a real author on an adjacent topic whose actual work does not support the specific claim being made at the scope being asserted: the model has produced the pattern of "assertion followed by supporting citation" without running the check that the citation actually supports the assertion. The most dangerous version of this failure is the real reference that does not say what the text claims it says – one that survives any verification short of fetching the cited source and reading the relevant passage, and which is far harder to detect than an entirely fabricated reference that a quick search resolves.

Audience-unaware design. A pedagogical tool for children encountering a concept for the first time is designed as if the user already knows what to attend to and what to ignore. A decomposition visualisation renders bars of equal width regardless of magnitude, breaking the pedagogical point it is meant to make. A number-line shows both numbers already placed, turning what should be a visible transformation into a static statement. Each is a failure of the audience model: the tool was designed without a specific user in view, and so defaulted to an abstract "learner" who already knows the things the tool is supposed to teach.

Content loss during format conversion. A user asks for a document converted to another format. The model treats the task as regeneration rather than transformation, and what returns is a document with the requested format and much of the original content silently rewritten, restructured, or abbreviated. No flag is raised; the output looks like what was requested. The failure is particularly hard to catch because the user, having received a plausibly-formatted file, may not check whether the content of the original survived.

These six patterns are domain-shaped but share a structural cause. None of them is a capability failure in the sense that the model could not have produced better output: all are failures of reasoning about purpose before generation, and each would have been caught by a reasoning step that was not run – an explicit audience model, an explicit check that the instruction serves the goal rather than displacing it, an explicit test of whether the citation supports the claim, an explicit content-preservation step during transformation. The capability was present; the conditions of the task did not require it to be exercised.

Why this is a structural property of generation, not an occasional error

It is tempting to treat these failures as local bugs – the result of imperfect training, solvable by better data, better fine-tuning, or a more carefully worded prompt. This is not wrong, exactly, but it is incomplete, because the failures are symptoms of how pattern-completion works as a generation strategy interacting with how attention distributes its weight across a context window that fills as a session progresses. Understanding both mechanisms is necessary for seeing why prompt-level responses do not scale.

The first mechanism: pattern-completion on a training distribution. A language model has been exposed to an enormous volume of text, including enormous volumes of knowledge work outputs – policy documents, research papers, reports, essays, slide decks, code. From these, it has learned not the reasoning processes that produced any particular document, but the statistical regularities of what such documents look like. When it is asked to produce a new one, it completes the pattern it has absorbed. This is why the output looks so convincing. It is also why the output is unreliable on dimensions that require reasoning about the specific task at hand rather than about what tasks of this type typically look like.

The training distribution is not a neutral and balanced map of the language the way a dictionary or a curriculum can be balanced. Model training requires enormous volumes of text, and what goes in is not carefully curated – it includes whatever text exists at scale, and certain types of text exist at far greater volume than others. It skews toward the most common patterns, which in knowledge work means the patterns of competently-executed but not exceptional practice – because competently-executed, templatable, genre-conventional work is massively overrepresented in what has been digitised and indexed. The model has learned what policy briefs usually look like, where "usually" means an average over a large population of briefs, most of which are themselves pattern-matched to genre conventions rather than being products of deep purposive reasoning. In any specific case where the right answer departs from the average – where this audience needs something different, where this context changes what a good choice would be – the model's prior is pulling toward the average and away from the specific correct answer. This is the strongest structural property of how generation works, and no amount of better prompting overcomes it.

The second mechanism: context attenuation under growing session length. Attention, the mechanism by which a transformer model decides which parts of its context to weight most heavily when generating the next token, does not weight uniformly. More proximate content – the most recent sentences, the most recent tool outputs, the most recent instructions, the paragraph being extended – exerts stronger influence than content further back in the context window. This is how the mechanism is designed to work, and its consequences for a long knowledge work session are severe.

The implication is severe for any instruction-based approach to maintaining reasoning discipline. "Reason carefully from the audience's perspective" is a salient instruction in the first few hundred tokens of a session. Fifty thousand tokens later, after the model has processed source documents, drafted sections, responded to user feedback, executed tool calls, and produced intermediate outputs, that instruction has not been forgotten. It has been outweighed. The attention weight now flows primarily to proximate content – the sentence being completed, the tool output just returned, the user's most recent message – not to the careful instruction that set up the session. If the careful audience reasoning was done once at the start and is no longer actively shaping individual decisions, the model has reverted to generating under the influence of proximate context, which is generally the content of the immediate task rather than the audience model.

This is why we see the intent deficit most visibly in long, substantive knowledge work sessions rather than in short interactions. In a two-turn exchange – a question followed by an answer – the instruction and the generation are close enough that the instruction carries full weight. In a four-hour session producing a serious deliverable, the instructions established at session start are competing for attention with everything that has accumulated since. They lose the competition not because the model has ceased to understand them but because the mechanism by which they influence generation is structurally designed to prioritise proximate content.

Together these two properties – pattern-completion pulling toward genre averages, attention attenuation weakening instructions that were not reinstated – produce the intent deficit as a predictable deployment-level phenomenon, one that appears specifically in the conditions of real knowledge work: sessions long enough that initial instructions attenuate, tasks substantive enough that generation is not trivial, stakes high enough that the gap between the genre average and the correct specific answer matters. The short, controlled conditions under which capability benchmarks measure model performance do not surface it, which is why a model that scores at the frontier on standardised benchmarks can simultaneously produce the full catalogue of failures above in deployment – the benchmark measures capability, while the deployment conditions are what expose whether that capability is being exercised.

Why prompt engineering does not close the gap

Prompt engineering has produced a real toolkit for specific quality dimensions: chain-of-thought prompting improves step-by-step reasoning on tasks that benefit from decomposition; few-shot examples calibrate format and register effectively; role conditioning – "you are a senior policy analyst writing for…" – meaningfully shifts model output; self-consistency and verification loops catch certain classes of error. These are all real contributions, and none of them is what we are contesting.

What we are contesting is that any of them addresses the class of failure we are describing, which operates structurally across long sessions rather than at a single turn. Prompt engineering techniques are one-time conditioning interventions at context entry, reshaping the model's generation at the moment the prompt is processed, but they do not architecturally ensure that reasoning persists across a full session of substantive work, because the mechanism they act through – placing constraints in context – is subject to the same attention-weight decay as any other context-level intervention.

The failure mode we are describing is therefore not one that an individual prompt attempts to prevent. It is a failure that emerges across a long chain of decisions, each locally plausible but collectively drifted from what the task actually needed: no prompt at session start prevents this drift, because better prompts produce better starting conditions but not the mechanisms for maintaining those conditions as the session fills.

This also explains why users who have worked extensively with language models often report that they have learned to get good results through some combination of skill, taste, and repeated correction – without being able to reduce what they have learned to a set of rules that a less experienced user could follow. What they have actually learned is a running discipline of identifying drift and intervening to correct it: noticing when the output is missing the audience, when a recommendation is buried, when a claim needs a verification step, when the conversion has dropped content. The discipline is continuous and distributed across the whole session. It is not encoded in any single prompt; it is encoded in the user's attention and reflexes. That discipline is exactly what the architecture described in the next post tries to encode structurally, so that it does not depend on the user having independently developed it.

The structural parallel to isomorphic mimicry

There is a broader reason to take this class of failure seriously, beyond the immediate effects on knowledge work output quality. The pattern – surface form adopted in the absence of the reasoning that would connect form to function – is not confined to AI; it is among the most persistent findings in the state capability research literature, where it has been documented and studied under the name isomorphic mimicry.

Andrews, Pritchett, and Woolcock, in their work on the capability trap, use isomorphic mimicry to describe a pattern in which governments adopt the institutional forms of capable states – the organisational charts, the designated procedures, the named reform policies – without developing the underlying function those forms require to work. The forms are present while the function they are supposed to enable is not, and in their analysis this is the characteristic failure mode of state capability development in contexts where technical knowledge is abundant but the judgement required to translate that knowledge into effective situated action is scarce.

The structural analogy to the session-level intent deficit is more than metaphor: both operate by pattern-completion on surface representations in the absence of reasoning from underlying function. A ministry that adopts a modern procurement procedure without the administrative judgement to apply it in context is producing the form of capable procurement without the function; a model that produces a policy brief with the expected structure without the underlying reasoning about why this specific audience needs this specific structure is doing the same thing at a different scale.

Whether these are the same mechanism operating at different scales, or structurally analogous mechanisms with a shared formal description, is a question the theoretical literature would have to resolve. What matters for practice is that the diagnostic is the same and the architectural response has the same shape: in both cases the surface form is cheap, the reasoning from function is the scarce thing, and adding more forms – more policy templates at institutional scale, better prompts at session scale – does not close the gap, because the gap is between form and the reasoning that would make the form do what it is supposed to do.

This framing matters for how we understand what AI does and does not contribute to knowledge work in institutional contexts. A default-deployed language model is exceptionally good at producing the surface forms of competent knowledge work. It is unreliable, in the specific ways the six failure modes describe, at supplying the reasoning from function that makes those forms actually work. An organisation that deploys AI tools without mechanisms for maintaining reasoning discipline is not getting a scalable substitute for experienced practitioners. It is getting form production at scale without the function. For contexts where experienced practitioners are abundant and AI is supplementary, this may be tolerable – the practitioners supply the function, and the AI accelerates the form. For contexts where experienced practitioners are scarce, the same deployment pattern risks reproducing isomorphic mimicry at industrial speed. The third post in this series develops this implication in detail.

What has to be done about it

The rest of this series describes a response – an architecture rather than a prompt or a productivity technique: a structured set of mechanisms for keeping reasoning active across the full length of a substantive knowledge work session, distributed across multiple levels of the context that governs model behaviour so that no single level bears the full load of maintaining discipline. The mechanisms include specific reasoning patterns that establish purpose and audience before generation begins, per-decision checkpoints that prevent the accumulation of defaults, persistent artifacts that survive context attenuation, domain references that calibrate to specific genre requirements, and mechanical checks at the workflow stages where reasoning alone has been empirically shown to fail.

The architecture operates at a level that practitioners working with general-purpose AI interfaces can deploy without developer access – loaded as a context document at session start, optionally extended with filesystem artifacts in environments that support them. We do not claim it is the only response possible to the intent deficit, or a complete response. Training-level interventions would address the underlying distribution pull in ways that no session-level architecture can. But session-level architecture is what is available to practitioners and institutions now, working with the models that actually exist, and what we have built is an attempt to make that level do as much as it can.

The next post describes the architecture itself, mechanism by mechanism, including the theoretical basis for each and the failure mode it addresses. The one after that turns to the capacity-building implications – what becomes possible when a scaffold designed for reasoning discipline is also designed to externalise reasoning in ways that transfer to the practitioners working alongside it, and what that might mean for developing-country and public-sector contexts where the shortage of experienced practitioners is the binding constraint.