Framework v0.1

Sources & Attribution I

Intellectual lineage of the failure mode taxonomy.

Appendix — Research Grounding Framework

Part I: Failure Mode Lineage

Epistemic tier key (revised):

  • T1 — Primary empirical: original study with direct measurement
  • T2 — Secondary synthesis: claim derived across multiple T1 sources through reasonable extrapolation; synthesis move is explicit and traceable
  • T3 — Practitioner/domain observation: expert opinion, case study, domain observation
  • T4 — reviewed — Novel framing, explicitly human-reviewed
  • T4 — provisional — Novel framing, not yet reviewed; should not appear in main document without acknowledgment

FM-1: Fabrication


Claim: LLMs generate factually false content with surface fluency, and do so with systematic frequency in research contexts. [T1]

The existence of hallucination as a named failure mode is well-established across multiple independent research programs. The most directly relevant empirical grounding in this corpus is the GPTZero finding that approximately 17% of a sample of early ICLR 2026 submissions contained at least one hallucinated citation — a finding about author behavior in research contexts specifically. This figure is illustrative rather than foundational; hallucination rates vary substantially by domain, task, and model generation and should not be cited as a stable baseline.

Sources: GPTZero/ICLR 2026 submission analysis (confirmed with correction — see validation log Entry 3). General hallucination literature is extensive; no single source is load-bearing for this claim.


Claim: Fabrication is structurally distinct from other failure modes — it introduces content with no basis in any source, rather than distorting existing content. [T2]

The hallucination literature does not draw this distinction explicitly — most taxonomies treat fabrication, confabulation, and attribution drift under a single umbrella. The framework separates FM-1 (fabrication: content invented wholesale) from FM-2 (attribution drift: real content misassigned) and FM-4 (synthesis validity: real content combined invalidly) because the intervention profiles differ. This is a synthesis move across the hallucination taxonomy literature: the structural distinction is derivable from existing typologies but is not stated in this form in any single source.

Sources: General hallucination and confabulation literature. Synthesis move is framework-level; the inference that distinct failure types warrant distinct interventions is traceable to the TRIZ structural borrowing logic documented in the TRIZ framework document.


FM-2: Attribution Drift


Claim: LLMs systematically misattribute findings, quotes, and claims to incorrect sources, and do so in ways that survive surface plausibility checks. [T1]

Confirmed empirically within this corpus. The validation log documents a concrete instance: synthesis documents consistently cited “Bogaert et al.” for a finding that belongs to van den Akker et al. (2024) — an attribution error that propagated across multiple documents without triggering any internal check. This is not an isolated transcription error but a documented instance of the failure mode operating on the framework’s own corpus, which gives it particular evidential weight: the failure was demonstrated in the process of building the framework designed to catch it.

Sources: van den Akker et al. (2024), Behavior Research Methods, DOI: 10.3758/s13428-023-02277-0. Validation log Entry 1. The propagation pattern is documented across corpus documents.


Claim: Attribution drift is structurally related to but distinct from fabrication — the content is real, but the provenance chain is broken. [T2]

Derivable from the documented van den Akker/Bogaert case: the finding was accurately represented while the attribution was wrong — the content passed a plausibility check that a provenance check would have caught. The distinction between content accuracy and provenance accuracy is implicit in source-checking methodology and explicit in citation verification practice; the framework names it and makes it the basis for a separate failure mode with a distinct intervention profile (GI-10 Decompose and Verify rather than GI-8 Ontology Grounding).

Sources: Validation log Entry 1 for the documented instance. The structural distinction is a synthesis inference across citation verification methodology and the hallucination literature.


Claim: Attribution drift compounds via synthesis — each generation of secondary citation increases the probability of further drift. [T2]

The ICLR 20% conflation (validation log Entry 3) is a documented single-generation case: two distinct findings were merged into one composite statistic in one synthesis pass. The compounding claim extends this to multiple generations, which is a reasonable extrapolation from information-theoretic principles and the documented single-generation case, but has not been tested longitudinally in this corpus or elsewhere in the cited literature.

Sources: Validation log Entry 3 for the single-generation case. The compounding inference is T2 — reasonable extrapolation from documented evidence, but requiring longitudinal validation to upgrade to T1.


FM-3: Absence of Disconfirmation


Claim: LLMs systematically fail to surface evidence that contradicts the user’s implicit or explicit hypothesis, producing outputs that are confirmatory by default. [T1]

Batista & Griffiths (2026) provide direct empirical grounding. In a modified Wason 2-4-6 rule discovery task (N=557), default unmodified GPT behavior suppressed rule discovery and inflated confidence comparably to explicitly sycophantic prompting, while unbiased sampling yielded discovery rates approximately five times higher (29.5% vs. 5.9%, equivalence-tested). The mechanism is formally characterized: a sycophantic AI samples from the distribution implied by the user’s hypothesis rather than from the true distribution, producing data that feels like evidence but carries no diagnostic value.

Sources: Batista & Griffiths (2026), arXiv:2602.14270v1. Preprint caveat applies. Model tested: GPT-5.1-Chat.


Claim: The absence of disconfirmation is epistemically more dangerous than hallucination in research contexts, because it produces false certainty rather than false content. [T2]

Batista & Griffiths make an adjacent argument — “unlike hallucinations that introduce falsehoods, sycophancy distorts reality by returning responses that are biased to reinforce existing beliefs” — but do not rank the failure modes by severity. The framework’s claim that FM-3 is more dangerous than FM-1 in research contexts specifically is a synthesis inference: false content can in principle be caught by verification, whereas false certainty suppresses the motivation to verify. This is derivable from the Batista & Griffiths mechanism but goes beyond what they state.

Sources: Batista & Griffiths (2026) for the mechanism. The severity ranking is T2 — derivable from the cited evidence but not stated in any source.


Claim: The sycophancy ≈ HARKing structural analog is valid. [T2]

HARKing (Hypothesizing After Results Are Known) is a well-documented QRP in which researchers construct hypotheses post-hoc to match obtained results. Batista & Griffiths provide formal Bayesian grounding for the sycophancy side: an agent receiving hypothesis-consistent data becomes increasingly confident in an incorrect hypothesis while making no progress toward truth — structurally identical to the epistemic outcome of HARKing. The mapping between the two is a synthesis inference across the QRP literature and the sycophancy empirics; it is not stated in either source but is directly derivable from both.

PARKing (Prompt Adjustments to Reach Known Outcomes), introduced by Kosch & Feger (2026), is the researcher-behavior analog: where sycophancy is the model-side mechanism, PARKing is the researcher-side behavior it enables. Both are relevant to FM-3 and are brought together here by framework synthesis.

Sources: Batista & Griffiths (2026) for the sycophancy mechanism; Kosch & Feger (2026) for PARKing — noting that Kosch & Feger is an opinion piece in CACM, not an empirical study. The HARKing analog mapping is T2.


FM-4: Synthesis Validity


Claim: LLMs combine findings across sources in ways that violate the conditions under which those findings are valid. [T2] The mechanism is documented in the hallucination and synthesis literature: LLMs generate from parametric memory rather than from retrieved sources, producing claims that blend material from distinct findings without preserving the conditions under which each applies. This is T2 — derivable from the hallucination literature but not stated in any single source as a synthesis validity claim in the sense defined here. A documented instance from this corpus (validation log Entry 3) illustrates the failure mode in practice: a figure conflating GPTZero’s finding about hallucinated citations in author submissions with Pangram Labs’ finding about AI-generated peer reviews — two distinct phenomena merged into a single unsourced statistic. The instance is illustrative, not the primary grounding. Sources: Hallucination literature generally; RAG and parametric memory research for the mechanism. Validation log Entry 3 as a documented instance. T2.

Claim: Synthesis validity failures are structurally related to the Synthesis Amplification Tendency documented in the validation log. [T3]

The validation log cross-cutting observations document a consistent pattern across multiple source validation entries: synthesis documents overstated the strength, universality, or causal certainty of findings. This is a practitioner observation from systematic source-checking within this project — domain observation carried out by the framework authors, not an independently published empirical finding. It carries T3 status and would require external replication to upgrade.

Sources: Validation log, Cross-Cutting Observations section. T3 — framework authors’ own systematic observation.


FM-5: Confidence Miscalibration


Claim: LLMs express confidence levels that do not accurately reflect the epistemic status of their claims, typically in the direction of overconfidence. [T1]

Fernandes et al. (2025) provide direct empirical grounding: AI assistance improved task performance while degrading metacognitive accuracy — participants became less able to assess what they knew and didn’t know after AI-assisted work. This is evidence of confidence miscalibration at the human-AI system level. The Batista & Griffiths finding that participants became increasingly confident in incorrect hypotheses while interacting with default LLMs is a second independent T1 grounding for the overconfidence direction.

Sources: Fernandes et al. (2025), Computers in Human Behavior, DOI: 10.1016/j.chb.2025.108779. Caveats: Study 1 quasi-experimental; AI literacy reversal finding has small effect sizes; model tested is ChatGPT-4o (gpt-4o-2024-05-13). Batista & Griffiths (2026) as secondary empirical grounding.


Claim: The T1/T2/T3/T4 tier system deployed in this framework is itself a grounding intervention for FM-5, and is a novel contribution of the framework. [T4 — provisional]

The tier taxonomy is novel in its application to AI research quality contexts. Adjacent prior art includes uncertainty quantification in ML (model-level, not output-level), confidence interval reporting in statistics (quantitative claims, not prose synthesis), and GRADE evidence quality frameworks in medicine (domain-specific, not generalizable to AI-assisted research). The tier system draws on these traditions but is not derivable from any of them. Requires human review before the framework claims this as a contribution.


FM-6: Contextual Override


Claim: LLMs revert to training-pattern responses when presented with cases that superficially resemble familiar patterns but differ in contextually significant ways (Mechanism A). [T1]

Soffer et al. (2025) provide primary empirical grounding. Presented with modified versions of well-known puzzles and ethics scenarios, LLMs reverted to training-data solutions at high error rates (lateral thinking: 58–92%; medical ethics: 76–96%), even in models with demonstrable reasoning capacity to recognize the modification. The mechanism is training-pattern dominance framed through dual-process theory.

Sources: Soffer, S., Sorin, V., Nadkarni, G.N., & Klang, E. (2025). npj Digital Medicine, 8, 461. DOI: 10.1038/s41746-025-01792-y. Attribution note: “reasoning-action disconnect” terminology is from a secondary blog source and does not appear in the paper.


Claim: A second independent mechanism — unfaithful chain-of-thought — produces contextual override through post-hoc rationalization rather than pattern reversion (Mechanism B). [T1]

Turpin et al. (2023) establish that CoT explanations systematically misrepresent the true reasons for model predictions under adversarial biasing. Arcuschin et al. (2026) extend this to non-adversarial naturally worded prompts, confirming implicit post-hoc rationalization at rates up to 13.49% across 15 frontier models. The two mechanisms are empirically documented independently and produce the same observable outcome — outputs that do not faithfully represent the model’s actual reasoning process — through different routes.

Sources: Turpin et al. (2023), NeurIPS 2023, arXiv:2305.04388. Model caveat: GPT-3.5 and Claude 1.0 only. Arcuschin et al. (2026), ICML 2026, arXiv:2503.08679v5. Models: 15 frontier models including current generation.


Claim: FM-6 Mechanism B may not be fully addressable at the prompt/GI level. [T2]

Arcuschin et al. show that answer biases are partially encoded in model representations before reasoning begins, and that thinking models are substantially more faithful than non-thinking models (gap of ~17pp for illogical shortcuts). Turpin et al. show that few-shot prompting reduces unfaithfulness relative to zero-shot. Together these findings support the inference that Mechanism B is partially addressable through inference-time interventions (thinking budget, prompting) but not eliminable — consistent with a partial architectural origin. This is a synthesis inference across both sources; neither states it directly.

Sources: Arcuschin et al. (2026) for the representation-level evidence and thinking model gap. Turpin et al. (2023) for the few-shot mitigation evidence. T2 — reasonable synthesis inference, not stated in either source.


FM-7: Omission


Claim: LLMs systematically fail to surface relevant evidence that contradicts, complicates, or bounds their claims, producing outputs that are technically accurate but incomplete in ways that mislead. [T2]

This failure mode has weaker primary empirical grounding than FM-1 through FM-6. The claim draws on two T1 sources: Batista & Griffiths (2026), where default LLM behavior systematically omits data conflicting with the user’s hypothesis, and the general sycophancy literature documenting selective omission of disconfirming evidence. FM-7 is broader than sycophancy-driven omission — it includes structural omissions from knowledge cutoff, retrieval scope, and domain coverage — making the full claim T2: derivable from the sycophancy empirics by extension but not directly measured.

Sources: Batista & Griffiths (2026) for sycophancy-driven omission. The broader structural omission claim is T2. This is the failure mode most in need of a dedicated primary source.


Claim: Omission is structurally distinct from fabrication and attribution drift because it operates through absence rather than presence, leaving no detectable artifact at the output level. [T2]

This distinction is derivable from the definitions of FM-1 and FM-2: both produce checkable artifacts (false claims, wrong attributions), while omission produces no artifact — detection requires independent knowledge of what should have been included. The distinction is not stated explicitly in the cited literature but follows directly from comparing the failure mode definitions. It is load-bearing for the GI-15 intervention design: Scope Enumeration works by converting absence into a presence that can be evaluated.

Sources: Inference across FM-1, FM-2, and FM-7 definitions. T2 — derivable from the cited structure, not stated in any single source.


FM-8: Pragmatic Distortion


Claim: LLMs systematically reframe, soften, or reshape content to match inferred user preferences, producing outputs oriented toward agreement. [T1]

Sharma et al. (2024) provide foundational grounding: models conform to user preferences in judgment tasks, shifting answers when users indicate disagreement, across Claude 1.3, Claude 2, GPT-3.5, GPT-4, and LLaMA 2. The multi-model, multi-family replication strengthens the claim that this is a systematic property of RLHF-trained models. Shapira et al. (2026) provide complementary grounding using reward model experiments.

Sources: Sharma et al. (2024), ICLR 2024, arXiv:2310.13548. Model caveat: 2023-era models. Shapira, Benade & Procaccia (2026) — model caveat: open-source reward models only, not frontier deployed systems.

Shi, J., Zhang, T.J., Jin, Z., & Conitzer, V. (2026). From hallucination to scheming: A unified taxonomy and benchmark analysis for LLM deception. arXiv:2604.04788.


Claim: Pragmatic distortion is broader than sycophancy — it includes framing and emphasis shifts that leave factual content unchanged. [T2]

The sycophancy literature focuses on cases where models change stated positions on factual or evaluative questions. FM-8 extends this to cases where factual content is unchanged but pragmatic orientation shifts — what is foregrounded, what caveats are included, what implications are drawn. This extension is a synthesis inference: the sycophancy mechanisms described in Sharma et al. are consistent with producing pragmatic distortion without factual change, but this specific form is not the focus of the empirical studies.

Sources: Sharma et al. (2024) for the mechanism. The framing/emphasis extension is T2 — derivable from the mechanism but not directly measured.


FM-9: Structural Drift


Claim: In multi-turn AI-assisted research sessions, the interpretive frame of AI outputs expands progressively to incorporate concepts and implications not present in the original query, without researcher awareness. [T1]

Kim et al. (2026) provide primary empirical grounding across 105 multi-turn dialogues. Domain amplification confirmed across four dimensions (Atmosphere d=0.46, Ipseity d=0.31, Intersubjectivity d=0.33, Temporality d=0.14). Domain expansion — new interpretive domains appearing absent from user input — occurred in 83.8% of dialogues, with divergence beginning within the first 10% of normalized dialogue time. Negative controls confirm expansion is not generic conversational elaboration.

Sources: Kim, J.E., Holbrook, E., Hron, J.D., & Parsons, C.R. (2026). medRxiv, DOI: 10.64898/2026.03.19.26346371. Preprint caveat applies. Models: GPT-5.2, Gemini-2.5-Flash, Claude Sonnet 4.5.


Claim: Structural drift is distinct from sycophancy because it does not require agreement — drift operates through meaning-expansion regardless of the model’s stance. [T1/T2]

Kim et al. articulate this distinction explicitly, providing T1 grounding for the FM-8/FM-9 separation. The T2 component is the framework’s inference that this distinction requires a session-level rather than claim-level intervention — specifically GI-13 (Session Reset) as FM-9’s primary GI. That inference follows directly from the Kim et al. mechanism description but is not stated there.

Sources: Kim et al. (2026) for the mechanism distinction. The intervention inference is T2.


Claim: “Credentialing Drift” — the temporal erosion of signal value of any quality credential as the field learns to produce the credential without the underlying behavior — is a novel named phenomenon. [T4 — provisional]

The underlying dynamic is described by Goodhart’s Law and Campbell’s Law. The 2026-06-03 research pass confirmed that PRISMA compliance, pre-registration badges, and similar research quality credentials show this pattern. “Credentialing Drift” as a named phenomenon applied specifically to research quality infrastructure is the framework’s own contribution. The name and the application domain are not present in the Goodhart/Campbell literature. Requires human review before the framework claims this as a named contribution. The specific claim that this phenomenon has not previously been named in this form also requires a more systematic prior art search than has been performed.

Sources: Goodhart’s Law; Campbell’s Law. The named phenomenon and its application to AI research quality infrastructure is T4 — provisional.

← Back to Framework