Part 4 of a series on the structural problem of AI in knowledge work. Part 1 diagnosed the intent deficit; part 2 described a reasoning architecture for addressing it; part 3 drew out the capacity-building implications. This post shows both the failure modes and the architecture's corrections in concrete form, through six paired cases.
The previous posts in this series described, first, a structural pattern of failure in AI-assisted knowledge work – outputs that match the surface form of competent practice while missing the reasoning that would justify the specific choices for the specific purpose, audience, and context – and, second, a reasoning architecture designed to address that failure by keeping purpose, audience, and context active as generation proceeds. Both posts were general: they described classes of failure and classes of response at a level of abstraction that could cover many instances.
This post works at a different level. It presents six cases where the failure mode was observed in specific outputs, followed by what was produced when the same task was attempted with the reasoning architecture loaded and active. The cases are deliberately varied: some involve structural choices at the level of the whole document, some are about tone and register, some are about element-level design, and one is about a failure that no amount of reasoning catches and which requires a mechanical process gate. Each case illustrates a mechanism described in the architecture post, shown operating on a concrete task rather than described in the abstract.
A note on how the scaffold-active versions are described. The architecture runs the same process for every task: three theories of mind (the assigner, the ideal expert, the audience), followed by the output-as-means clarification, followed by derived decisions about content, structure, register, approach, and typology, followed by the framing check presented to the user, followed by execution under those derived decisions, with the audit protocols at delivery. What differs across cases is not the process but what the process produces for each specific task. To show this clearly, each positive-side narration below walks through the same steps in the same order, emphasising the step that did the load-bearing work for that specific case. The consistency across cases is what matters: the architecture is not a loose collection of helpful tricks that happen to catch different kinds of failure. It is a single structured process whose value comes from being run every time, and whose output varies because the substance of each task varies.
A note on the case material. The failures described below are not edge cases or adversarial constructions; they are the normal output of capable models operating on the normal kinds of substantive tasks practitioners encounter. They are also not exotic to any one institution or domain. A reader working in policy, research, communications, design, or any substantive knowledge-work field will likely recognise the patterns. Project-identifying detail has been abstracted out; what remains is the failure mode and the mechanism that catches it, which are the parts that matter.
A proposal was being drafted to a funder. Before asking for the proposal to be written, the session had included a frank internal discussion: the author's organisation was working in a space where a competing organisation had received substantially more funding and produced work the author felt was weaker. The discussion had been conversational and candid, in the register appropriate to an exchange between colleagues. That framing was reasonable as internal context. The proposal then needed to make a case for the author's organisation's funding that would necessarily involve drawing some contrast with what the funder had previously supported elsewhere.
The proposal reproduced the internal frame. The section positioning the author's organisation against its peers was phrased in language that implied the competitor had wasted donor funds – comparative claims with an edge, adjectival characterisations of the competitor's output as weaker, an overall posture of relative grievance rather than relative confidence. Read from the position of the funder, this did not read as "this organisation is confident in its work." It read as "this organisation is defined by its resentment of better-funded peers." A funder reading the proposal would form a view of the author not by the organisation's own work but by the aggression of the contrast. The proposal undermined its own case.
The mechanism behind the failure: instruction-latching combined with context inheritance. The internal frame saturated the most proximate context and carried through into generation. The model did not reason about the audience shift from "colleagues in a candid exchange" to "external funder deciding whether to engage further with us"; it treated the whole session as one context and produced output in the register of the most recent emotional frame. This is a structural property of how generation under attention works, not a model-specific error.
With the reasoning architecture loaded, the process ran in full before any drafting began.
The assigner profile was constructed at the level the scaffold requires: not the task description (draft the competitive-positioning section) but the underlying goal. What the assigner was actually trying to accomplish was not to highlight a contrast with a competitor; it was to secure this funder's support for the organisation's work, while establishing a relationship with the funder that would permit future rounds. The proposal was a means toward that goal, not the goal itself. A proposal that damaged the organisation's standing with the funder failed even if it was well-written.
The ideal-expert profile specified a senior development professional with experience in funder communications, whose quality signals would include the confident articulation of an organisation's own work without defining the organisation through its peers; restraint in comparative language; the recognition that funders read for signals of institutional maturity, and that an organisation positioning itself primarily through others' weaknesses signals the opposite of maturity. Errors the ideal expert would recognise on sight: any language that reads as grievance, any characterisation of competitors that a reader could repeat in a way that would embarrass the organisation, any tonal inheritance from internal conversation into external document.
The audience profile was constructed: a funder who had previously supported peer organisations, who had their own reasons for those earlier decisions, who was now evaluating whether to engage with this organisation, and who would form their impression not primarily from the substantive arguments the proposal made but from the character of the organisation that came through in how the proposal was written. A funder who encountered a proposal that characterised their previous grantees negatively would read that not as information about the grantees but as information about the proposer.
The output-as-means clarification followed from these profiles: the proposal is the means by which the funder's support is secured, and a proposal that positions the organisation through its peers' failures defeats the means regardless of what it says. The goal reframes the task from "draft the competitive-positioning section" to "produce a section that makes the funder confident in this organisation's capacity to do the work."
The derived decisions fell out of this: competitive framing removed as a structural device; differentiation made through the substance of what the organisation had achieved and what the funder's support would enable; comparative language restricted to factual statements about scope and approach, not evaluative claims about competitor quality; tone calibrated to confidence rather than contrast; the persuasive-communications reference specifically invoked for its guidance on differentiating through substance rather than through characterising others negatively.
The framing document presented these decisions to the user for confirmation: the section was being treated as a confidence-building artifact rather than a competitive statement, the positioning was being made through the organisation's own substance, and if the intent was actually to critique the funder's prior choices, the approach would need to change. The user confirmed the framing as stated.
Execution then produced a positioning section in which the organisation's standing was established by what it had done and what it was positioned to do, not by what peers had failed to do. The funder, reading the finished proposal, would encounter an organisation confident enough in its own work that it did not need to define itself through its peers.
The move that made the difference was the assigner-at-goal-level step, which forced the reframe from output-as-end (the section) to output-as-means (the funder's support). Everything downstream – the derived decisions, the language calibration, the invocation of the persuasive-communications reference – followed from that reframe. A model that skips this step and accepts the task description as the goal will produce a section calibrated to how such sections look. A model that performs this step produces a section calibrated to what such sections are for. The same architecture would run the same steps for any persuasive-communications task; what would change is what the specific profiles produced, and what the specific derived decisions concluded.
A substantive document was being produced for an audience of senior officials. The instructions given to the model were thoughtful and specific: be thorough rather than surface-level; use prose and paragraphs rather than bullet points; explain the relationships between ideas; give context sufficient for the document to stand on its own. These were good instructions for the register being asked for. The author provided two or three existing documents as stylistic examples, and asked for in-depth analytical development rather than a summary.
The document that came back was in every respect what was instructed. It was thorough. It was in prose. The relationships between ideas were articulated carefully. Context was provided. What it also was: structurally impenetrable to the audience for whom it was intended. The operational content – the specific proposals being advanced, the decisions being recommended, the findings being presented for action – was distributed through paragraphs of argument on pages four and five, reachable only by sustained close reading. An official who received this document and gave it the typical first-encounter attention a busy reader gives a new document could not have identified what was being proposed at a scan. They would have set it aside, likely without reading further.
The mechanism of the failure: the model had latched onto the nearest-stated instruction (be thorough, use prose, explain) and treated it as the operative goal. In doing so, it lost track of what the instruction was for – a document that a senior decision-maker could actually engage with given their real working conditions. The instruction was followed. The goal was abandoned.
The process ran fully, with the audience step as the load-bearing move.
The assigner profile identified the goal: the document existed to move the recipient in the direction the evidence warranted – implementation of the proposals. Not "produce a document that demonstrates thorough analysis," which was how the task description read. The document was a means by which decision and action would occur; a document that did not reach decision was a failed document regardless of how thorough its prose.
The ideal-expert profile specified a senior policy communications professional whose experience included producing analytical work for time-pressed government readerships. Quality signals would include: operational content identifiable at first scan; argument organised so a reader could enter at multiple points; analytical depth preserved for the reader who wanted it but not at the cost of the scannable layer.
The audience profile was constructed at the level the scaffold requires for multi-layered documents. The primary audience was not a single composite reader; it was two distinct readers interacting with the same document under different conditions. The senior decision-maker was modelled specifically: receiving many documents per week, giving each first-encounter attention of under a minute, committing to a full read only if that first encounter signalled the document was worth their time, with the engagement decision resting on whether they could identify what was being proposed at a scan. The junior staffer to whom the senior might delegate implementation was modelled as a second audience: a careful reader who would work through the document in full, who needed the analytical depth and the implementation-level specificity, who would use the document as a working reference through the period of implementation. The tension between the two audiences was noted explicitly: a document optimised only for the senior scanner would be too thin for the junior implementer; a document optimised only for the junior implementer would bury the point the senior needed to find.
The output-as-means clarification: the document is the means by which the senior decision-maker makes a decision to implement, and the means by which the junior implementer executes. Both functions must be served for the document to succeed. A document that serves one and fails the other is a failed document.
The derived decisions followed from the layered audience: operational content front-loaded in a form the senior scanner could identify in the first thirty seconds; analytical argument organised under headings that made the structure navigable at scan; implementation-level specificity preserved in the sections the junior reader would work through; the detailed-prose instruction retained where it served the argument but not at the cost of the document's legibility to the primary reader. The structure was designed for the layering rather than for a composite reader.
The framing document presented this to the user: the document is being structured to serve both a senior scanner and a junior implementer, with the layering deliberate; if the primary audience is actually only the senior reader, the implementation sections will be compressed; if the primary audience is actually the implementer, the front-loaded scan layer will be compressed. The user confirmed the dual-audience reading.
Execution produced a document whose operational content was identifiable in the first thirty seconds of reading, whose analytical argument was available for the reader who wanted it, and whose implementation sections had the specificity the downstream reader required. The thorough-prose instruction was honoured where it did work; it was overridden where it would have buried what the primary audience needed to find.
The move that made the difference was the multi-layered audience model – the recognition that a single document often serves multiple readers with different working conditions, and that the structure has to be designed for the layering rather than collapsed into a composite reader who is actually no one. This is a structural audience-modelling move that the scaffold forces because its audience profile explicitly requires resolving tensions between multiple audiences rather than averaging them. A composite "senior officials" audience would have produced something closer to the default output; a layered audience model produced a document where both readers encounter what they need at the depth they need it.
A set of interactive learning tools was being built for primary-school students encountering numeracy concepts for the first time. The pedagogical approach had been specified by the author: the concepts were number lines, decomposition, magnitude comparison, place value; the pedagogical method was articulated (what each tool should teach, how the interaction should unfold, what the learner should experience at each step of the visual transformation); the learning outcomes were laid out. The model's task was not to design the pedagogy – that had been done – but to build the interactive tools that would implement it.
The tools ran. They looked clean. They covered the specified concepts. Tested by someone who already knew what the tools were supposed to teach, they would have seemed to work. A child encountering them for the first time would have received, through their specific interface decisions, a cluster of subtle miseducations.
A decomposition bar diagram was built with all bars at equal width regardless of the magnitudes they represented – a clean visual grid, proportional spacing, good use of colour. For a learner who already understood that quantity corresponds to bar length, the equal widths were easily set aside as a rendering choice. For a six-year-old encountering decomposition for the first time, the equal widths were the visual statement the tool was making: quantity does not correspond to size. The tool, at the implementation layer, taught the opposite of the concept the pedagogy had specified.
A number-line visualisation began with both numbers already placed on the line. The pedagogy had specified a dynamic transformation: the learner places the first number, then the second, then sees the relationship. The implementation produced a static display where both numbers were already there.
A slider was used for an input the pedagogy had specified as discrete (integer choices only). The slider implied continuous variation, suggesting that intermediate non-integer values were available, where the concept being taught required that they were not.
A dark-mode toggle was implemented by programmatic colour inversion. The pedagogy had used food analogies (red apples, yellow bananas) for their cognitive availability to young children. Under inversion, the apples became cyan and the bananas became blue. The analogy's functional role – a food the child could identify – was destroyed at the implementation layer, turning the pedagogical device nonsensical in one of the modes the tool was meant to support.
Each of these is a failure at the implementation layer, not at the pedagogical layer. The pedagogy had been correctly received and the tools technically implemented it; what the specific interface decisions each did was break the pedagogy's functional requirements in ways invisible in code review, and that became visible only by rendering the tools and watching them from the position of a first-encounter child learner, which default generation did not do. The deeper failure was the absence of a specific user in view. The default implementation targeted an abstract "learner" who was implicitly imagined as already knowing the things the tool was supposed to teach.
The process ran with the audience profile doing nested-audience work.
The assigner profile was constructed at the goal level: the tools existed so that children who had not previously encountered these concepts would encounter them through the tool and leave with a working grasp of them. The tools were a means toward that pedagogical outcome. A tool that technically covered the concept but miscommunicated it through its interface layer was a failed tool.
The ideal-expert profile specified a developer or designer experienced in building educational interfaces for early-primary learners, whose quality signals would include: every visual property of the interface is either doing pedagogical work or is being actively controlled so that it does not distract from the pedagogical point; transformations the pedagogy specifies as dynamic are implemented dynamically; discrete inputs use discrete interface elements; mode changes preserve the functional semantics of any visual analogies; element sizes are calibrated to the attention patterns of the intended age group. Errors the ideal expert would recognise on sight: equal-width bars for unequal quantities; static displays of transformations the pedagogy specified as dynamic; continuous controls for discrete inputs; any mode change that turned a visual analogy into nonsense.
The audience profile was constructed with explicit nesting. The direct audience was the teacher or programme coordinator who would evaluate the tool for classroom deployment. But the evaluator's quality signal was not "does the tool look good"; their quality signal was "will a child encountering this learn what the tool is trying to teach." To serve the evaluator, the tool had to serve the end user the evaluator was deciding on behalf of. The end-user profile was therefore constructed as the primary audience model: a six-year-old encountering decomposition or magnitude for the first time, whose attention is drawn by size, colour, and motion, whose filter for which visual properties are incidental and which carry meaning has not yet been built, who reads the visual statement the tool makes about quantity as the statement about quantity the tool is making.
The derived decisions fell out in the form of an element typology specific to this project. Bars in magnitude visualisations: proportional to magnitude, always. Number-line transformations: dynamic, with the learner placing each value and seeing the relationship emerge from the placement. Interactive inputs: discrete inputs use discrete UI elements (steppers, buttons, selection grids), continuous inputs use sliders. Colour treatment across modes: colour choices checked against their functional role before any inversion or shift rule is applied. Layout: element sizes calibrated to legibility and attention for the age group.
The framing document presented the primary versus secondary audience split explicitly: the tool is being designed for the child learner as the primary audience, with the teacher-evaluator as a secondary audience whose quality signal depends on child-learner outcomes. The user confirmed the child-primary framing.
Execution then produced tools whose interface decisions were each specifically serving the pedagogical concept. The visual QA protocol caught the remaining implementation-level issues at render time.
Design work suffers acutely from audience-agnostic defaults because the gap between a design that serves a specific user and a design that looks reasonable in the abstract is invisible at the code or markup level. It becomes visible only when the design is evaluated against a specific user who does not yet know the things the design is trying to teach or show. The scaffold's nested-audience pattern – modelling not only the immediate evaluator but the end user the evaluator is deciding on behalf of – is what catches these failures at design time. Notably, the pedagogical approach itself was specified by the author and did not need to be generated by the model; the failure and the correction both operated at the implementation layer, where the scaffold's audience modelling determined which interface decisions would actually deliver the pedagogy the author had designed.
A policy brief was being developed on a sensitive topic – a legal and social domain where a documented pattern of harm was being analysed and recommendations for legislative and procedural change were being developed. The session, before arriving at the specific analysis that eventually failed, had spent extensive time engaging with case studies and interview transcripts illustrating the pattern of harm under examination. For each case, the analysis had been careful and rigorous. The cases shared a structural pattern: harm of a specific kind, falling on a specific identifiable group.
Then, within the same session, a new case study was introduced. The structural pattern was similar – a disputed situation involving the same broad category of harm – but the demographic profile of the case was different. Not the pattern of victim; an instance where the roles were not straightforwardly distributed the way the preceding cases had distributed them.
The analysis of the new case imported interpretive frames from the prior cases without the evidential basis for doing so. Suspicion fell in the same direction it had fallen in the preceding cases, by pattern-match rather than by evidence. Categorical assumptions that had been appropriate to the preceding case-by-case analyses (given their shared structural properties) were silently applied to a case whose structural properties did not warrant them. The analysis was confident-toned, carefully worded, and substantively wrong about what the new case's evidence actually showed. Had the brief been submitted to its intended audience – legal reviewers and legislators whose evaluation standards are especially stringent on gender-sensitive material – a reviewer would have flagged the methodological failure on first reading. The brief could have been dismissed as advocacy targeted at a specific demographic, collapsing the argument regardless of the quality of its recommendations. Ironclad neutrality was not a stylistic preference for this brief; it was a load-bearing requirement for the brief to function.
This case illustrates the scaffold process operating across a longer session, where the framing produced at session start must be re-instantiated at mid-session decision points rather than trusted to persist automatically.
The assigner profile at session start identified the goal: legislative and procedural change to address the documented pattern of harm. The brief was the means by which decision-makers would be moved to enact change. Anything that caused the brief to be rejected by those decision-makers defeated the means.
The ideal-expert profile specified a legal and policy researcher with experience producing evidence-based reform proposals on sensitive topics for legislative audiences. Quality signals would include: analytical neutrality as a load-bearing requirement, not a stylistic register; each case analysed on its own evidence, without importing interpretive frames from structurally different cases; caveats explicit where the evidence warrants caveats; conclusions proportional to the evidence supporting them. Errors the ideal expert would recognise: any interpretive move not grounded in the specific case's evidence, any category-level assumption that a reviewer could test and find unsupported.
The audience profile was constructed with specific attention to the rejection thresholds: legal reviewers and legislators who evaluate policy proposals against standards of evidentiary rigour and methodological neutrality that are especially stringent for gender-sensitive material. The profile named the specific way the brief could fail: if the brief could be characterised as advocacy targeted at a specific demographic rather than as analysis of a pattern of harm, the argument collapses regardless of the quality of its recommendations.
The derived decisions followed: each case analysed on its own evidence; no interpretive frames imported from structurally different cases without evidential warrant; caveats explicit where the evidence warrants them; conclusions proportional to evidence; the brief's methodology specifically designed to survive hostile review.
The mechanism that did the load-bearing work for this case then operated mid-session. When the new case study was introduced, the scaffold required consulting the framing document before the analytical decision was made. The mid-session re-read re-instated the audience model at full weight. The question the model asked at that point was not "how does this case fit the pattern?" but the question the framing mandated: "what would this audience require of the analysis of this specific case, and what would cause them to reject it regardless of its recommendations?" Asked that question, the answer was clear: a case with different structural properties requires analysis on its own evidence; importing interpretive frames from the preceding cases would be the methodological failure the framing had specifically identified as disqualifying. The analysis of the new case was produced on its own evidence, with interpretive frames constrained to what the case's specific evidence warranted.
Attention attenuation is the mechanism by which earlier framings lose influence over later decisions. The corrective is not better initial instructions – those will attenuate the same way – but persistent artifacts that can be re-read at decision points to restore the framing to full weight. The architecture's instruction to consult the project understanding before major decisions is specifically designed for failures of this class: failures that arise mid-session, after proximate context has crowded out the earlier framing, and which require the framing to be actively reinstated rather than passively trusted to still be in force. The three theories of mind, the derived decisions, and the framing document together are the artifact that gets re-read. Their value at session hour four depends on their actually being consulted when they are needed; the scaffold makes that consultation a required step rather than an optional one.
An institutional training deck was being prepared for an audience of subject-matter experts in the field the training concerned. The participants would be senior professionals with extensive prior knowledge of the domain – the frameworks, the legal or regulatory context, the standard practices – who were being trained on a specific programme's structure, schedule, and operational mechanics. The audience needed granular detail on the programme itself; they did not need foundational explanations of the domain within which the programme operated, because they lived in that domain every day.
The deck was structured around a long scene-setting introduction: several slides explaining the concept the training addressed, its definitions, its variants, the legal context, the broader challenges the domain faced. This is the structure an institutional-training deck defaults to when produced by a model that has learned what institutional-training decks look like across a training distribution that skews toward audiences who need to be brought up to speed. For this specific expert audience, the foundational content would read as condescending at best and as wasting their time at worst.
The slides were also visually busy in the way default institutional-training decks tend to be: icon grids, card layouts, stat callouts, parallel text structures across slides. Each individual design choice was familiar from presentation conventions. The deck's overall visual language read as generic institutional content rather than as a working document produced for a specific professional audience.
The process ran with the audience profile producing substantial content-removal decisions before any drafting.
The assigner profile identified the goal: the participants must leave the training equipped to carry out the programme. The deck was the means by which the programme's operational content would be transmitted. A deck that used most of its time explaining foundations the audience already knew would transmit less of the programme's operational content than the audience needed, and would also signal that the organisation delivering the training had not thought carefully about who the audience actually was – an institutional credibility failure regardless of the substantive training that came afterward.
The audience profile was constructed with specific attention to what the audience already knew. The participants were identified as subject-matter experts who already understood the field's frameworks, its legal or regulatory context, its standard practices, and the challenges it faced. What they did not know, and what they were attending the training to learn, was: the specific programme's structure in operational detail; the day-by-day schedule with times and topics; the mechanics of certification or completion; the experiential practice methodology as it would actually be conducted. The profile explicitly named the audience's quality signal: a deck that explained to them what their own professional domain was would signal that the organisation had not understood who it was speaking to, which is an institutional credibility failure.
The output-as-means clarification: the deck is the means by which the operational content of the programme is transmitted to the audience. Every minute of deck time spent on foundations the audience already knows is a minute not spent on operational content; the output-as-means framing forces this trade-off to be seen explicitly rather than hidden under "the deck has the standard sections."
The derived decisions fell out directly from the audience profile: the foundational content was removed. Not compressed, not relocated – removed. Slides explaining the domain's frameworks, legal context, and standard practices were cut. The space that had been allocated to them was reallocated to detailed, day-by-day breakdowns of the programme's structure; specific agendas with times and topics; operational mechanics of certification; the experiential practice methodology as it would actually be conducted. The deck got shorter in total length because of the removal and longer in operational content because of the reallocation.
The element typology followed from the same audience model. Card-based layouts that read as templated were replaced with structural treatments that matched the density of the specific content – clean tables for agenda information with time slots and session names, varied layouts across slides rather than repetitive grids. Icon decoration was removed where it was not doing functional work. Colour palette was restrained: a primary institutional colour used for section anchors, a secondary colour for accents used sparingly. Language was calibrated to the register of a working programme document: direct, operational, specific, without marketing-inflected framings like "ensuring participants understand" or "empowering professionals to."
Execution produced a deck that opened with the programme overview, moved directly into the day-by-day operational content, spent its time on what the audience was actually there to learn, and used design language that matched the institutional context rather than the marketing context default generation had produced.
Audience theory-of-mind is not fundamentally about tone. It is about structural decisions – what content to include, what content to exclude, how to arrange what remains, what register the deliverable should operate in – that follow from a specific understanding of what the audience knows, needs, and brings to the encounter. The most consequential effect of running the audience check for this case was not that the language was adjusted; it was that roughly a third of the original content was cut because the audience already knew it. This is the kind of decision that is difficult to make from the position of default generation, because default generation's priors are drawn from a training distribution in which audiences more often need the foundational content than not. Audience modelling set correctly reverses the prior where the specific audience requires it.
A deck was being generated programmatically – slides produced by code rather than by direct visual editing. Each slide was specified in the code with positional values for its elements: the table starts at vertical position 1.3 inches from the top, each row is 0.42 inches tall, the deck has ten rows on this slide, the footer sits at vertical position 5.2 inches. Numbers in source code. The generation ran, produced the deck as a PPTX file, and the file was ready to deliver.
The table, at ten rows of 0.42 inches starting from 1.3 inches, ends at 5.5 inches (1.3 + 10 × 0.42). The footer sits at 5.2 inches. 5.5 is greater than 5.2. The last row of the table overlaps the footer. Every element of this calculation is in the source code; none of the numbers are hidden. A careful code review could catch it. In practice, code-level review of a deck with dozens of slides, each with its own positional parameters, does not catch dimensional failures of this class reliably. The reviewer is looking at code, thinking about code-level properties, and the specific interaction between three numbers on three different lines is not salient at the level where the review is happening.
This case is structurally different from the preceding five, because the failure is not one that reasoning catches regardless of how well the process runs. The three theories of mind were constructed; the output-as-means clarification was made; the derived decisions produced the element typology, including layout and positional choices. All of this was done correctly. None of it caught the overlap.
The overlap was caught by the mechanical process gate that runs regardless of how well the reasoning has run: visual QA.
The architecture's visual QA protocol requires that any rendered visual deliverable be produced as image – PPTX converted to PDF, PDF rasterised to per-page PNGs at sufficient resolution – and that the images be actively inspected before the deliverable is considered complete. The inspection is not a glance; it is an active pass looking for specific failure modes from a catalogue: text clipping at containers or boundaries, overlapping elements, inconsistent margins across pages, orphaned headings separated from their content. On this deck, the footer overlap was visible immediately when the slide was rendered and inspected, because the overlap, which is invisible in three lines of code, is unmistakable as an image.
The failure catalogue at this point produced the specific diagnostic check. Rather than re-running the generation and hoping the overlap was gone, the protocol computed it: ten rows at 0.42 inches is 4.2 inches; starting from 1.3 inches, this ends at 5.5; the footer is at 5.2; 5.5 is greater than 5.2; the overlap is 0.3 inches. The corrective computation: reducing row height to 0.38 inches gives ten rows at 3.8 inches, ending at 5.1 inches, which clears the footer at 5.2. A surgical fix was applied (row height reduced from 0.42 to 0.38 for the specific slide) rather than a rebuild. The fixed version was re-rendered, re-inspected, and confirmed clean before the audit moved on.
Some failure modes are not reliably caught by reasoning, no matter how active the reasoning is or how well the theories of mind are constructed. A careful reviewer running the earn-its-place test on each element of a slide would not necessarily do the vertical-position calculation that would have caught this overlap. The calculation is cheap but specific; running it requires noticing that it is the calculation to run. Visual QA as a required process gate makes the failure visible by shifting the mode of inspection from code to image. The failure catalogue then provides the specific check: the overlap becomes not a visual judgement but a dimensional calculation that either clears or does not clear, with arithmetic that either resolves or does not resolve.
This is also the case that most clearly illustrates why a failure catalogue's value scales with specificity. A catalogue that said "check layouts carefully" would not have produced this correction. A catalogue that said "check for footer overlap on content-dense slides" would have directed attention to the right place but would still have required visual judgement. A catalogue that says "for content-dense slides with tabular layouts, compute table-end position as starting position plus (row count × row height), and verify that this position clears the footer position" produces a check the model can run mechanically. The catalogue entry encodes not only what to check for but how to check for it. This specificity is what converts a skill file from a generic framework into an institutional asset, and the mechanism by which institutions that accumulate specificity in their failure catalogues will see outsized quality improvements relative to institutions that rely on generic guidance.
Six cases are a small sample, but the pattern they illustrate is consistent across a much larger body of substantive knowledge-work output. Three features of that pattern deserve explicit naming.
The failure modes are not exotic. Every case above describes the normal behaviour of capable models operating on the normal kinds of substantive tasks practitioners perform. No case required an adversarial prompt or an edge-case input. Case 1 happened because an internal discussion preceded an external-facing task; Case 2 because thoughtful instructions were followed literally; Case 5 because the model produced the deck its training distribution told it institutional-training decks typically look like. These are not errors that better training data alone will eliminate, because the training distribution contains the patterns the model is reproducing – the patterns are the problem, not the training's failure to capture them.
The correctives are not more instructions. In every case, what caught the failure was a structural change to what the session required, not a more carefully-written prompt. The three theories of mind, the output-as-means clarification, the derived decisions, the framing check, and the audit protocols at delivery are each required steps rather than optional instructions. The scaffold's mechanism is not better prompt wording; it is making specific reasoning steps into required gates that generation does not proceed past.
The process was the same in every case. This is the point the paired structure of the cases was designed to show. In every case, the same steps ran: three profiles, output-as-means, derived decisions, framing document presented to the user, execution, audit gates at delivery. What differed across cases was what the process produced for each specific task – for Case 1, an assigner profile that reframed the competitive section away from grievance; for Case 2, an audience profile that identified two distinct readers and forced layered structure; for Case 3, a nested audience profile where the direct audience's quality signal depended on the end user's experience; for Case 4, a framing document that mattered more when re-read at a mid-session decision point than when produced at session start; for Case 5, an audience profile that drove substantial content removal because the audience already knew the material; for Case 6, an audit gate that caught a mechanical failure reasoning could not catch. The architecture does not work by containing different mechanisms for different cases; it works by being a single structured process, run consistently, whose output varies because the substance of the task varies.
The broader implication – that architectures of this kind could be designed not only for output quality but for the transfer of professional judgement to practitioners working alongside them – was developed in the third post in this series. What the cases above add is what the third post could only argue theoretically: that the mechanisms described work on specific failures in specific ways, that the improvements they produce are both real and describable, and that the consistency with which the process runs across cases is what makes the architecture reliable rather than clever.