Rigorous AI-Assisted Research

When I started standing up AI-assisted research projects in earnest, a pattern appeared quickly. Outputs required substantial intervention before they could be trusted. It was not because the AI was obviously wrong, but because it was fluently, confidently, and plausibly wrong in ways that weren’t easy to catch. The interventions I made were often correct, but the process for arriving at them wasn’t.

Two questions followed: what would a grounded approach look like, and could the failure modes be identified in advance? This framework is the answer to both. AI-assisted research fails in a small number of recurring structural patterns, each rooted in a definable contradiction between what AI systems are optimized to produce and what research quality requires. Naming those patterns, mapping them to a finite set of targeted interventions, and making the epistemic status of every claim explicit transforms AI-assisted research into an auditable practice.

I. The Problem

The cost of rejecting AI tools wholesale is no longer theoretical. AI is not merely a productivity tool for research — across more than 170 scientific fields, it measurably increases novelty and impact, redirects collective attention toward unexplored knowledge regions, and is increasingly necessary for preserving scientific knowledge at the scale modern research generates it (Bianchini et al., 2026; Malliaraki & Berditchevskaia; Rainford, et. al., 2026). The core problem with AI-assisted research is not that AI makes mistakes. Every analytical tool makes mistakes. The problem is that AI makes them fluently, at scale, in ways that look like finished work. The outputs have polished prose, properly formed citations, and well-structured conclusions. This fluency is the product of what these systems were trained to do.

This isn’t a reason to avoid AI tools. Powerful analytical tools can carry known failure conditions, and the appropriate response has never been to discard the tool. FFT produces aliasing artifacts when sampling conditions aren’t respected. This is not a failure of the algorithm, it’s a failure of instrumentation methodology. A researcher who understands aliasing designs their sampling accordingly. The failure mode is not in the transform; it’s in applying it without understanding what it requires. AI-assisted research fails in structurally analogous ways.

The practitioner response to AI quality failures has tended toward the ad hoc: prompt revision, manual fact-checking, skepticism as a disposition. These are individually reasonable responses, but not yet systematic nor anticipatory. These responses are improvised in the face of specific outputs rather than designed against the class of failure that produced them. There is no shared vocabulary for what specifically goes wrong. Without that vocabulary, there is no basis for designing against it — only for reacting to it. This is illustrated by Kosch and Feger (2026) who draw the analogy explicitly: iterating prompts until an output satisfies is structurally similar to p-hacking. This is rational at the individual level and a symptom of absent methodology.

The cost compounds. Errors anticipated before generation require minimal effort to prevent. Errors caught post-generation require effort to diagnose and correct — and may have already been incorporated into downstream synthesis. Errors not caught propagate forward, gaining apparent credibility each time they’re cited without challenge.

A structural answer looks like this: if the failure modes are predictable, finite, and have a definable mechanism, then they can be named. If they can be named, interventions can be designed. If interventions can be designed and applied at the right process stage, the research process becomes auditable rather than hopeful.

Research Grounding Framework

Two Approaches to AI Research Quality

Figure 1

Ad Hoc Intervention

UnstructuredReactiveLate

Query

AI Output

Use

fabrication

attr. drift

synthesis error

miscalibration

intervention: late, improvised

Errors anticipated at no stage. Caught post-generation if caught at all. Errors not caught propagate forward — gaining apparent credibility each time they are cited without challenge.

Structured Grounding

AnticipatoryTargetedAuditable

Query

Pre-Gen

At-Gen

Post-Gen

Use

Pre-GenerationAnticipatory

GI-1GI-2GI-3GI-14+ more

At-GenerationTargeted

GI-4GI-5GI-6GI-7+ more

Post-GenerationAuditable

GI-9GI-10GI-11GI-12+ more

Failures addressed at the stage where they originate. The process becomes auditable — not hopeful.

II. How This Framework Was Built

The framework began as a practical problem, not a theoretical one. Repeated experience with AI-assisted research projects produced a recognition: the interventions we were applying were often right, but we were arriving at them by intuition rather than by design. There was no principled way to anticipate what category of failure had occurred or what class of intervention addressed it.

The first question was taxonomic: are there a finite number of ways AI-assisted research fails, or is it an open-ended problem? The working hypothesis — borrowed from TRIZ’s core insight that inventive problems reduce to a small number of recurring contradiction types — was that failure modes are enumerable (Altshuller, 1996). If the contradictions driving failures are structural properties of how language models are built and trained, the failures themselves should be classifiable.

The structural borrowing from TRIZ is epistemological, not artifactual. The contradiction matrix and the forty inventive principles were not adopted. What transfers is the underlying logic: enumerate the contradictions, map them to a finite solution library, build the routing between them empirically.

Validation was systematic. Each failure mode and each intervention was traced to its source evidence, with the epistemic status of every claim made explicit — primary empirical where directly grounded in studies, synthesized or observational where it is not. Key citations appear with each failure mode in Section III. The source validation process — including several failures it caught in its own corpus — is published as a companion document.

The framework was developed using its own methods — source validation, epistemic tagging, and adversarial probing were applied to the framework’s own claims throughout development. This is the demonstration rather than a disclaimer: the authorship model is in [link to 003]; the grounding process in practice is in the Sources and Attribution appendix.

III. The Failure Mode Taxonomy

These are not bugs. They are structural properties of how language models are built and trained. The contradiction driving all nine failure modes is between fluency optimization, what RLHF rewards, and epistemic accuracy, which research requires. The failure modes are where that contradiction surfaces.

Nine modes, organized loosely from content failures (FM-1 through FM-4) to calibration and framing failures (FM-5 through FM-8) to session-level failures (FM-9).

Research Grounding Framework

Failure Mode Taxonomy

9 modes \u00b7 3 groups

Code

Name

Description

Process Stage

Content FailuresFM-1 – FM-4

FM-1

FabricationGeneration

Claim attributed to a source that does not exist or contains no material relevant to the claim

At generation · survives surface review

FM-2

Attribution DriftGeneration

Source exists and is accessible, but the claim has drifted from what the source actually says

At generation · survives existence check

FM-3

Absence of DisconfirmationRetrieval

Confirming sources retrieved; evidence that contradicts, qualifies, or limits the claim not searched

At retrieval · procedural failure; claim may be true

FM-4

Synthesis ValidityGeneration

AI-generated synthesis presented as a documented finding; not present in any single source

At generation · survives source existence checks

Calibration and Framing FailuresFM-5 – FM-8

FM-5

Confidence MiscalibrationGeneration

Epistemic status of a claim not tagged, incorrectly tagged, or misrepresented in argument weight

At generation · invisible in fluent prose

FM-6

Contextual OverrideGeneration

Model defaults to familiar training pattern; responds to what the prompt resembles rather than what it says

Single-turn · reasoning trace unreliable if Mech. B active

FM-7

OmissionRetrieval

Relevant information exists in the evidence base but is not surfaced in output; failure by absence

At retrieval · no artifact; hardest to detect

FM-8

Pragmatic DistortionGeneration

Technically accurate but framing, emphasis, or selection creates a misleading impression

At generation · resistant to source-based checking

Session-Level FailuresFM-9

FM-9

Structural DriftSession

Over multi-turn sessions, AI outputs increasingly reflect and amplify the framing of the initial exchanges

Cumulative across session · Session Reset is diagnostic

FM-1: Fabrication

A claim is made and attributed to a source that does not exist, cannot be located, or — in a stricter form — exists but contains no material relevant to the claim. The AI has generated a plausible citation from parametric memory rather than from actual retrieval.

Fabrication is structurally dangerous because it is self-concealing: a fabricated source looks like a real source in prose. It passes every surface coherence check ordinary reading involves. Detection requires independent verification of source existence. Fabrication rates vary substantially by model and domain; a GPTZero analysis of early ICLR 2026 submissions found hallucinated citations in roughly 17% of sampled papers (GPTZero, 2026), though this figure should be understood as a time-stamped estimate against specific model versions, not a stable baseline.

FM-2: Attribution Drift

A source exists and is accessible, but the claim attached to it has drifted from what the source actually says. The drift may be subtle — a finding generalized beyond its scope, a caveat quietly dropped, a correlation silently upgraded to causation — or it may be structural, as when synthesis documents consistently amplify findings beyond what primary sources claim.

Unlike fabrication, the citation is real and will survive an existence check. Detection requires reading the source. A documented instance from this corpus: in early synthesis passes, a 2024 study on preregistration in psychology was consistently attributed to “Bogaert et al.” No author named Bogaert appears anywhere in the paper. The actual lead author is van den Akker. The name drifted silently through multiple iterations of the synthesis without triggering any surface check.

FM-3: Absence of Disconfirmation

The evidence base supporting a claim has been populated with confirming sources; evidence that contradicts, qualifies, or limits the claim has not been retrieved. The failure is not that no contradicting evidence exists — it may not. The failure is that no search for contradicting evidence was conducted.

AI-assisted search exhibits a documented tendency toward sycophantic retrieval: surfacing material that confirms the framing of the query. Batista and Griffiths (2026) formally establish the Bayesian mechanism: RLHF training rewards confirmation, which systematically biases retrieval toward hypothesis-supporting material. A claim confirmed by AI-assisted search carries no more evidential weight than a claim confirmed by a researcher who never considered alternatives.

FM-4: Synthesis Validity

The AI combines material from multiple sources, or generates from parametric memory, and produces a claim not present in any single source. The synthesis may be coherent, plausible, and even correct, but it is a constructed position, not a documented one. It is presented, implicitly or explicitly, as if it were a finding from the literature.

A documented instance: early synthesis work on ICLR 2026 produced a composite statistic claiming ~20% of ICLR submissions were problematically AI-generated. The figure conflated two entirely distinct findings: GPTZero’s analysis of ~17% of author submissions containing at least one hallucinated citation (GPTZero, 2026), and Pangram Labs’ analysis of ~21% of peer reviews being fully AI-generated (Pangram Labs, 2026). Two different phenomena — author behavior and reviewer behavior — merged into a single number with no source that supports it.

FM-5: Confidence Miscalibration

The epistemic status of a claim is not tagged, is incorrectly tagged, or is systematically misrepresented in the weight given to it in an argument. A practitioner observation is treated with the weight of a randomized controlled trial (RCT). An AI-generated synthesis is presented as a primary empirical finding. A single-study result is generalized as if it were meta-analytic consensus.

Confidence miscalibration is often invisible in well-written prose. The language of certainty is consistent regardless of whether the underlying evidence warrants it. The same fluency that produces miscalibrated claims also undermines the researcher’s ability to detect them: Fernandes et al. (2025) document that AI use produces systematic overestimation of one’s own output quality, with the effect largest among AI-literate users. The mechanism is self-reinforcing — the AI’s fluency suppresses the epistemic vigilance that would otherwise flag uncertain claims.

FM-6: Contextual Override

The AI defaults to a familiar, training-data-dominant response pattern when the actual task requires processing context that differs materially from that pattern. The model responds to what the prompt resembles rather than what it says. Two mechanisms produce this failure:

Mechanism A — Pattern-match override: The model recognizes a prompt as resembling a familiar problem type and applies the trained solution for that type, overriding context-specific details. Soffer et al. (2025) documented this in medical ethics scenarios: models defaulted to canonical ethical framework responses regardless of contextual nuance deliberately introduced into the scenarios, with error rates of 76–96%. The mechanism is consistent with dual-process theory — fast, pattern-based processing overriding slow, deliberative processing.

Mechanism B — Unfaithful chain-of-thought: The model’s reasoning trace arrives at a correct or qualified conclusion, but the generated output contradicts or ignores it. The verbalized reasoning is closer to post-hoc rationalization than an accurate window into the computation that produced the output. Turpin et al. (2023, NeurIPS) established the phenomenon under biasing conditions; Arcuschin et al. (2026, ICML) extended it to naturally worded, non-adversarial prompts across 15 frontier models, finding unfaithfulness rates ranging from 0.04% to 13.49% — with the lowest rates corresponding to advanced, thinking-enabled models and the highest to non-thinking counterparts. Thinking models are substantially more faithful than non-thinking counterparts, with a gap of roughly 17 percentage points consistent across Anthropic, DeepSeek, and Qwen model pairs — but the gap is a reduction, not an elimination. Where Mechanism B is present, the reliability of the reasoning trace as a verification tool degrades in proportion — it should be treated as evidence, not confirmation

FM-7: Omission

Relevant information exists in the available evidence base but is not surfaced in the output. The output is not wrong about what it says, it is incomplete in what it includes. The failure is in coverage, not in accuracy.

Omission is the hardest failure mode to detect because it leaves no artifact. There is no false citation to check, no drifted attribution to compare, no invalid synthesis to trace. The gap is invisible unless you already know what should be there. Shi et al. (2026) found that only 18% of 50 evaluated benchmarks test for omission, compared to 100% that test fabrication. Omission evades detection and measurement for the same reason.

FM-8: Pragmatic Distortion

A claim is technically accurate — the source exists, the attribution is correct, the epistemic status is appropriately assigned — but the framing, emphasis, or presentation creates a misleading impression. Common forms: asymmetric emphasis (benefits foregrounded, costs buried), false balance (minority and majority positions presented as equivalent), decontextualization (a finding presented without limiting conditions), and sycophantic framing (output shaped toward what the researcher appears to want to hear).

Pragmatic distortion is the failure mode most resistant to source-based checking (Shi et al., 2026). The sources are real, accurately cited, and faithfully represented. The distortion occurs at the level of selection, emphasis, and arrangement — and AI systems trained to produce satisfying outputs are structurally inclined toward it.

FM-9: Structural Drift

Over the course of a multi-turn research session, the AI’s outputs increasingly reflect and amplify the framing established in the initial exchanges. Later outputs are not generated from a neutral starting position — they are generated from a context window progressively shaped by earlier turns. The frame tightens; confirming material is more readily produced; qualifying material recedes.

Two mechanisms operate at the session level. Domain amplification intensifies existing concerns across exchanges. Domain expansion introduces entirely new interpretive frames that were absent from the researcher’s input — the AI actively scaffolds new dimensions rather than just amplifying existing ones. Kim et al. (2026) documented both in clinical AI dialogues, finding domain expansion in 83.8% of sessions, with the divergence beginning within the first 10% of normalized dialogue time.

Structural drift does not require the AI to agree with the researcher. The distortion is in how meaning is scaffolded across time, not in stance. By the time it is visible, it has typically already shaped outputs that will be carried forward.

FM	Documented Instance
FM-1 Fabrication	latentscholar.org ~40% protocol deviation figure — confirmed AI-generated synthetic content; no underlying dataset exists
FM-2 Attribution Drift	”Bogaert et al.” (van den Akker et al. 2024) — author name drifted silently across multiple synthesis iterations; no Bogaert in the paper
FM-3 Absence of Disconfirmation	Synthesis Amplification Tendency — across 12 corpus entries, confirming material consistently foregrounded; disconfirmation search not conducted by default
FM-4 Synthesis Validity	ICLR ~20% composite figure — GPTZero author-citation finding and Pangram Labs reviewer-behavior finding merged into single unsourced statistic
FM-5 Confidence Miscalibration	Sharma et al. ICLR 2024 sycophancy findings generalized from 2023-era models to all frontier systems without caveat
FM-6 Contextual Override	Entry 6: “Mount Sinai reasoning-action disconnect” terminology — introduced by commercial blog reframing Soffer et al. (2025); original paper contains no such framing
FM-7 Omission	Domain expansion mechanism in FM-9: absent from initial FM-9 definition; Kim et al. (2026) finding not surfaced in prior synthesis pass
FM-8 Pragmatic Distortion	Fernandes et al. (2025) AI literacy reversal: effect sizes and quasi-experimental caveat omitted; finding presented as stronger than source supports
FM-9 Structural Drift	Kim et al. (2026) clinical dialogue study: domain expansion in 83.8% of sessions; divergence begins within first 10% of dialogue time

IV. The Grounding Interventions

GIs are not a checklist. They are a structured response library. Not all fifteen are needed for every research task — the matrix in the next section determines which apply to which failure modes. The task is to recognize the failure mode, locate it in the matrix, and apply the appropriate intervention at the appropriate stage.

Organization by process stage is intentional. Pre-generation interventions change what gets produced. At-generation interventions change how it gets produced. Post-generation interventions catch what slipped through. Session-level intervention resets the conditions under which all others operate.

Research Grounding Framework

Grounding Interventions Reference

15 interventions · 4 stages

Code

Name

Mode

Description

Primary FM(s)

Pre-GenerationGI-1, GI-2, GI-3, GI-14

Constrain what gets produced before generation begins

GI-1

Source AnchoringPreventative

Requires AI to attach a verifiable, accessible source to any factual claim at the point of output

FM-1

GI-2

Temporal ScopingPreventative

Restricts claims to a defined knowledge window anchored to a stated review date or training cutoff

FM-5

GI-3

Epistemic Status TaggingPreventative

Assigns each claim to a confidence tier (T1–T4) before use as the basis for argument or synthesis

FM-5

GI-14

Context-Delta MarkingPreventative

Requires AI to articulate how the current case differs from the canonical version before generating — novel contribution

FM-6

At-GenerationGI-4 through GI-8, GI-15

Shape how the AI reasons and retrieves while producing output

GI-4

Retrieval AugmentationPreventative

Forces re-query against a verified, bounded corpus rather than relying on model weights alone

FM-1

GI-5

Mandatory Disconfirmation SearchPreventative

Explicitly requires a search for evidence contradicting the current claim before that claim is accepted

FM-3

GI-6

Contradiction ForcingPreventative

TRIZ-derived: any claim presenting a benefit must articulate what degrades when that benefit is realized

FM-3FM-8

GI-7

9-WindowsPreventative

TRIZ-derived: examines a claim from nine perspectives — system, subsystems, supersystem × past, present, future

FM-8FM-9

GI-8

Ontology GroundingPreventative

Anchors named entities to a structured, verifiable knowledge base before they are used in claims

FM-1

GI-15

Scope EnumerationPreventative

Requires AI to explicitly state what it did not search for alongside what it did — novel contribution

FM-7

Post-GenerationGI-9 through GI-12

Detect, diagnose, and correct failures in outputs already produced

GI-9

Provenance TracingCorrective

Traces an existing claim backward to determine its origin; claims identified as T4 carry a forward-prevention property

FM-4

GI-10

Decompose and VerifyCorrective

Breaks a complex or synthesized claim into constituent sub-claims and subjects each to independent checking

FM-4FM-2

GI-11

Adversarial ProbingCorrective

Routes an output to a verification process explicitly tasked with finding fault in the output as a whole

FM-8FM-6

GI-12

Adversarial ReframingCorrective

Instructs AI to construct the strongest case against its previous output; corrective counterpart to GI-6

FM-8FM-3

Session-LevelGI-13

Resets the conditions under which all other GIs operate

GI-13

Session ResetStructural

Mechanically clears the AI context window; resets accumulated framing rather than addressing a specific claim

FM-9

Pre-Generation: GI-1 through GI-3, GI-14

GI-1: Source Anchoring — an admissibility constraint. The AI is required to attach a verifiable, accessible source to any factual claim at the point of output. Unsourced claims are inadmissible. Source anchoring does not verify that the source supports the claim (that is GI-9’s function). It only ensures a source is named, preventing fabrication by removing the option of unattributed generation.

GI-2: Temporal Scoping — a knowledge window constraint. Claims are restricted to a defined review period, anchored to the model’s training cutoff or a stated review date. Prevents stale training-era claims from presenting as current.

GI-3: Epistemic Status Tagging — assigns each claim to one of four confidence tiers before it is used as a basis for argument or synthesis:

Tier	Type	Description
T1	Primary empirical	Original study with direct measurement. The claim is made in the source and supported by data collected for that purpose.
T2	Secondary synthesis	Claim derived by synthesis across multiple T1 sources, or from a systematic review or meta-analysis. Supportable from cited evidence through reasonable extrapolation; the synthesis move is explicit and traceable.
T3	Practitioner/domain observation	Expert opinion, case study, or practitioner framework. Carries evidential weight within its domain but lacks the measurement apparatus of T1.
T4	Novel framing	Conceptual content not derivable from cited sources even through reasonable extrapolation. Introduces named phenomena, structural distinctions, or analytical frameworks that are the framework’s own contribution. Requires explicit human review before use as a basis for argument.

Tagging does not assess quality within tiers. Its function is to prevent T4 claims from being treated as T1, and to make the evidential foundation of any argument visible and auditable. Unreviewed T4 claims carry a [T4 — provisional] notation; T4 claims that have been explicitly reviewed carry [T4 — reviewed]. A synthesis document that mixes T1 and T4 without labeling which is which is epistemically unreliable regardless of how confident it reads.

GI-14: Context-Delta Marking — a novel contribution of this framework. Before generating a response, the AI is required to explicitly state how the current case differs from the canonical or most familiar version of the problem. The mechanism targets training-pattern dominance directly: FM-6 Mechanism A occurs because the model processes surface similarity and retrieves a familiar response without registering contextual differences (Soffer et al., 2025). Requiring explicit articulation of the delta is meant to force the model to process novel features before generating. If the model cannot identify a meaningful delta, that itself is diagnostic.

At-Generation: GI-4 through GI-8, GI-15

GI-4: Retrieval Augmentation — forces re-query against a verified, bounded corpus for specific claims rather than relying on model weights. An unrestricted web search is not retrieval augmentation in this sense — the corpus must be explicitly defined and constrained.

GI-5: Mandatory Disconfirmation Search — explicitly requires a search for evidence contradicting the current claim before that claim is accepted as supported. It is not sufficient to note that no contradicting evidence was found; the search must be conducted and documented. This is the direct counter to sycophantic retrieval (Batista & Griffiths, 2026).

GI-6: Contradiction Forcing — a TRIZ-derived reasoning intervention. Any claim presenting a benefit, improvement, or positive finding must be accompanied by an explicit articulation of the corresponding tradeoff: what degrades when this improves? One of TRIZ’s core insight is that hard problems always involve a contradiction (Altshuller, 1996). Applied to grounding, this makes it structurally impossible for the AI to return a purely confirmatory response.

GI-7: 9-Windows — a TRIZ-derived frame expansion intervention (Altshuller, 1996). Before accepting a claim, require examination from nine perspectives: the system itself, its subsystems, and its supersystem, each across past, present, and future. A claim that looks robust at the system level may be undermined at the subsystem level or supersystem level, or may not survive a change in time horizon. In practice: before accepting a claim about a research methodology, 9-Windows requires examination of its components (has each element of the process always worked as intended?), the broader context it sits within (what happens to its signal value as adoption scales?), and its trajectory over time (when did it work differently, and why?). The nine perspectives surface questions the original query does not ask — which is the point.

The following table applies the intervention in full to a claim already present in this framework’s corpus. The claim is familiar; the value of the exercise is in what the supersystem × future cell surfaces.

Claim under examination: Pre-registration reduces p-hacking.

Research Grounding Framework · GI-7

9-Windows: Frame Expansion in Practice

3 perspectives · 3 time horizons · 9 cells

Claim under examinationPre-registration reduces p-hacking.

Past

Present

Future

Supersystem

The replication crisis emerged from an ecosystem that structurally incentivized p-hacking: journals rewarded novelty and positive results; no credential existed to distinguish pre-specified from post-hoc analysis. Pre-registration was a corrective response to conditions already producing failures.

Pre-registration is one quality signal among many in a peer review ecosystem that still privileges statistical significance and novel findings. Its epistemic weight depends partly on whether surrounding incentive structures reinforce or undermine it.

↑ key cellAs adoption becomes universal, the credential stops discriminating between rigorous and compliant-but-non-rigorous work. A field that requires pre-registration of all submissions cannot use pre-registration to identify which submissions are trustworthy.

System

Pre-registration originated in clinical trial management (ClinicalTrials.gov, 2000) as a response to documented outcome-switching. Its transfer to behavioral and social science was deliberate — a methodological import after high-profile replication failures in psychology (2010s).

Timestamped public registration of hypotheses, methods, and analysis plans before data collection. Provides temporal evidence that hypotheses preceded results. Reduces HARKing when implemented with specificity.

AI tools can produce plausible, internally consistent registration documents from outcomes already known. The mechanism — temporal precedence — survives; the epistemic guarantee that the mechanism was followed does not.

Subsystem

Early registration templates were minimal. Hypothesis statement, analysis plan, exclusion criteria, and sample size justification were developed iteratively. Initial versions lacked the specificity that makes deviation detectable.

Individual components carry unequal epistemic weight. Timestamping is robust and difficult to falsify. Analysis plan specificity varies widely. Deviation auditing between registration and publication is inconsistent and rarely enforced.

Automated timestamping scales; systematic audit of deviation between registration and final report does not, without dedicated tooling. Components degrade asymmetrically as volume increases.

ObservationThe system × present cell confirms the claim as stated. Eight of the nine cells support it, qualify it, or reveal its conditions of validity. The supersystem × future cell is where the claim's scope breaks down: universal adoption converts the credential from a signal into a baseline requirement, severing the relationship between the credential and the quality it was designed to indicate. That is the cell the original query — does pre-registration reduce p-hacking? — cannot reach. It is also the cell that matters most for a researcher deciding how much epistemic weight to place on a pre-registered study published in 2027.

The system × present cell confirms the claim as stated. Eight of the nine cells support it, qualify it, or reveal its conditions of validity. The supersystem × future cell is where the claim’s scope breaks down: universal adoption converts the credential from a signal into a baseline requirement, severing the relationship between the credential and the quality it was designed to indicate. That is the cell the original query — does pre-registration reduce p-hacking? — cannot reach. It is also the cell that matters most for a researcher deciding how much epistemic weight to place on a pre-registered study published in 2027.

GI-8: Ontology Grounding — requires named entities (people, studies, institutions, concepts) to be anchored to a structured, verifiable knowledge base before use. Catches the specific class of fabrication where entities are plausible but not real — a different target than GI-1 (which operates on sources attached to claims rather than entities within them).

GI-15: Scope Enumeration — a novel contribution of this framework. Before or alongside output, the AI states the boundaries of its search: what corpora were queried, what query terms were used, what categories of evidence were excluded or not attempted, what the output does not cover. GI-15 does not require the AI to know what it missed — it requires it to state what it attempted. The researcher, with domain knowledge, identifies the gap. Omissions that were invisible become visible as absences in the stated scope. The mechanism transfers systematic review discipline (PRISMA’s required reporting of search terms, databases, and exclusion criteria) to real-time AI-assisted research.

Post-Generation: GI-9 through GI-12

GI-9: Provenance Tracing — takes an existing claim and traces it backward to determine its origin: sourced to a named document, synthesized across sources, or generated from model weights without a source basis. Answers where did this come from? rather than is this true? Claims identified as T4 through provenance tracing carry a forward-prevention property: they should not be cited forward as sourced findings.

GI-10: Decompose and Verify — breaks a complex or synthesized claim into its constituent sub-claims and subjects each to independent checking. A complex claim may be partially true, with false or unsupported components hidden inside a plausible overall structure.

GI-11: Adversarial Probing — routes an output to a verification process explicitly tasked with finding fault. This may be a second AI query with an adversarial prompt, a second researcher reviewing for errors, or a structured checklist applied from a skeptical position. Its primary function is catching failures invisible to the original generation process — including confidence miscalibration that slipped through pre-generation prevention.

GI-12: Adversarial Reframing — instructs the AI to construct the strongest case against its previous output. Where Contradiction Forcing (GI-6) prevents one-sided outputs during generation, Adversarial Reframing corrects them after the fact. The output is not a replacement for the original claim but a counterweight.

Session-Level: GI-13

GI-13: Session Reset — mechanically clears the AI’s active context window and begins a new session, eliminating accumulated conversational history. Session Reset is categorically different from all other GIs: it does not operate on a specific claim, reasoning step, or output. It resets the conditions under which all subsequent GIs operate.

The operational cost is real: cleared context must be partially restored through a researcher-constructed grounding summary. That summary is itself a grounding act — it requires conscious selection of what context to reintroduce, which surfaces implicit framing assumptions that accumulated invisibly. A poorly constructed re-grounding summary can reintroduce the drift it was intended to clear. Recommended use is as a periodic structural discipline at natural break points, not only when drift is already visible (Kim et al., 2026). By the time drift is visible, it has typically already shaped outputs that will be carried forward.

V. The Matrix

The FM × GI matrix maps each failure mode to the interventions that provide primary coverage (explicitly designated in the GI definition) and secondary coverage (also designated, but not the primary target). A third category — structural or indirect coverage, marked ○ — captures relationships that are reasonably inferred but not documented in GI definitions.

GIs are organized by process stage. This matters: primary coverage at the pre-generation stage prevents a failure from occurring; primary coverage at the post-generation stage detects and corrects it after the fact. A failure mode with only post-generation coverage is harder to manage than one with prevention built in.

Interactive: The full FM × GI coverage matrix — with entry points by failure mode, grounding intervention, and cell-level coverage detail — is available as a standalone tool at Framework → Research Grounding Matrix.

The coverage findings are worth naming explicitly:

FM-1, FM-3, and FM-4 have the broadest coverage — six GIs each across multiple process stages. FM-8 (Pragmatic Distortion) has the most primary GIs — four — reflecting its multiple mechanism pathways and its structural resistance to source-based checking alone. FM-5, FM-6, and FM-7 have the most concentrated coverage — three GIs each.

GI-7 (9-Windows) and GI-11 (Adversarial Probing) reach the most failure modes — five FMs each — making them the highest-leverage single interventions in the library.

Research Grounding Framework

Coverage Analysis

FM × GI Matrix · v0.2

Coverage per Failure Mode

Name

Primary GIs

Total

FM-1

Fabrication

GI-1GI-4GI-8

FM-2

Attribution Drift

GI-10

FM-3

Absence of Disconfirmation

GI-5GI-6GI-12

FM-4

Synthesis Validity

GI-9GI-10

FM-5

Confidence Miscalibration

GI-2GI-3

FM-6

Contextual Override

GI-14GI-11

FM-7

Omission

GI-15

FM-8

Pragmatic Distortion

GI-6GI-7GI-11GI-12

FM-9

Structural Drift

GI-7GI-13

Coverage per Intervention

Name

Primary FMs

Total

GI-1

Source Anchoring

FM-1

GI-2

Temporal Scoping

FM-5

GI-3

Epistemic Status Tagging

FM-5

GI-14

Context-Delta Marking

FM-6

GI-4

Retrieval Augmentation

FM-1

GI-5

Mandatory Disconfirmation

FM-3

GI-6

Contradiction Forcing

FM-3FM-8

GI-7

9-Windows

FM-8FM-9

GI-8

Ontology Grounding

FM-1

GI-15

Scope Enumeration

FM-7

GI-9

Provenance Tracing

FM-4

GI-10

Decompose and Verify

FM-4FM-2

GI-11

Adversarial Probing

FM-8FM-6

GI-12

Adversarial Reframing

FM-8FM-3

GI-13

Session Reset

FM-9

structural

Two design gaps survive in the current version and should be named rather than papered over.

FM-6 Mechanism B (unfaithful chain-of-thought) has no pre-generation or at-generation intervention. GI-14 targets Mechanism A (pattern-match override); Mechanism B operates at a different level. Arcuschin et al. (2026) found that biases are partially encoded in model representations before reasoning begins — consistent with a partial architectural origin. Thinking models are substantially more faithful than non-thinking counterparts (roughly 17 percentage points across model pairs), suggesting partial addressability through inference-time compute, but not elimination. Whether a prompt-level pre-generation intervention is possible for Mechanism B is an open question; the gap may require architectural or training-level solutions outside this framework’s scope.

Pre-generation coverage for FM-3, FM-4, FM-7, FM-8, FM-9 is absent beyond the general constraints of GI-1 through GI-3. A pre-generation framing constraint — requiring explicit enumeration of what the query is not asking, analogous to what GI-1 does for sources but for the frame of the query itself — remains a candidate GI for a future version.

VI. Epistemic Status and the Framework’s Own Claims

The epistemic status tagging intervention (GI-3) is applied to the framework’s own claims in the Sources and Attribution appendix. Here is what that looks like in practice.

T1 in this framework — claims with primary empirical grounding:

FM-3: sycophancy as a retrieval failure (Batista & Griffiths, 2026; Sharma et al., 2024)
FM-6 Mechanism A: training-pattern dominance (Soffer et al., 2025)
FM-6 Mechanism B: unfaithful CoT (Turpin et al., 2023; Arcuschin et al., 2026)
FM-8: pragmatic distortion as a named deception mechanism and the most under-benchmarked failure mode in coverage (Sharma et al., 2024; Shi et al., 2026)
FM-9: session-level drift with domain amplification and expansion mechanisms (Kim et al., 2026)

T2 — most of the intervention design logic. The GIs are derivable from cited sources through reasonable extrapolation but are not stated as grounding interventions in any single source. The TRIZ-derived GIs (GI-6, GI-7) have clear intellectual lineage; their application as AI research grounding interventions is a T2 inference.

T3 — practitioner observations from the validation process itself: the Synthesis Amplification Tendency (AI synthesis consistently overstates the strength and universality of findings); the Blog-as-Primary-Source pattern (commercial blogs reframe primary findings with novel terminology, which then gets cited forward as if it were the primary source’s vocabulary).

T4 — Novel framing [provisional] — claims not derivable from cited sources even through reasonable extrapolation, requiring human review before being treated as established:

The T1–T4 tier system itself [T4 — provisional] — adjacent prior art (GRADE frameworks, uncertainty quantification conventions) informs but does not determine the tier structure
GI-14 (Context-Delta Marking) as a defined intervention [T4 — provisional] — mechanism is grounded in dual-process theory and prompt engineering literature; application as a pre-generation grounding constraint is novel
GI-15 (Scope Enumeration) as a defined intervention [T4 — provisional] — mechanism is grounded in PRISMA methodology; application as a real-time AI generation constraint is novel

This framework is a working document, not a finished theory. The T4 claims are identified and the validation path is clear. The framework improves as those claims are tested against use.

VII. Using the Framework

Entry Points

The framework is designed to be navigated from three directions.

By failure mode: When a researcher identifies that something has gone wrong, they locate the FM, look up its primary and secondary GIs, and apply the appropriate intervention. The matrix is the routing table.

By process stage: When a researcher is building a research session from scratch, they apply the appropriate GIs at each stage — pre-generation constraints before prompting, at-generation interventions during synthesis, post-generation checks on output.

By GI: When a researcher has built a particular intervention into regular practice — running adversarial probing on every synthesis output, for instance — the matrix shows which failure modes that habit covers and, importantly, which it does not. This entry point is most useful for understanding the gaps in an existing practice.

Session discipline is what makes the framework compound. GI-13 (Session Reset) is the structural discipline that enables all others. Periodic resets at natural break points prevent drift from accumulating before it becomes visible. The re-grounding summary written to re-enter a session is itself a grounding act subject to GI-1 (source anchoring) and GI-3 (epistemic tagging). A session that never resets is a session accumulating unchecked drift. The framework’s own development was conducted with explicit session resets and re-entry summaries.

Deployment Spectrum

The framework operates at three levels of implementation maturity, corresponding to layers in the harness engineering vocabulary established in [link to 002].

Level 1 — Researcher practice. GIs administered manually or semi-manually during a research session. The framework as a discipline, not a system. Entry point for any researcher regardless of technical infrastructure. The intervention list and matrix are sufficient for this level.

Level 2 — Agent skills. Pre-generation and at-generation GIs implemented as task harness profile specifications. GI-5 (Mandatory Disconfirmation Search) as a skill that fires on any claim-generation task. GI-15 (Scope Enumeration) as an output template requirement. GI-6 and GI-7 as reasoning constraint specifications baked into the task harness profile. Post-generation GIs (GI-11, GI-12) implementable as dedicated verifier agents in a multi-agent research pipeline.

Level 3 — Runtime harness layer. Pre-generation GIs (GI-1, GI-2, GI-3, GI-14) implemented at the infrastructure layer, transparent to the calling agent. The harness intercepts research-task API calls, applies generation constraints, and returns grounded responses without agent modification. The agent calls a model; the harness shapes what gets produced.

GI-13 (Session Reset) is the one intervention that stays at researcher practice level regardless of deployment. It requires awareness of session state and a conscious re-grounding decision that cannot be fully automated without reintroducing the framing it is meant to clear.

The deployment levels are additive, not exclusive: Level 1 is the baseline; Level 2 extends it into agent-skill infrastructure; Level 3 extends it into the inference boundary.

Research Grounding Framework

Deployment Spectrum

3 levels · additive

Level × Stage

Pre-Generationbefore prompting

At-Generationduring synthesis

Post-Generationon output

Levels are additive, not exclusive — Level 1 is the baseline; Level 2 extends into agent-skill infrastructure; Level 3 extends into the inference boundary.

Level 1Researcher PracticeGIs administered manually during a research session. The framework as a discipline, not a system.

GI-1GI-2GI-3GI-14

Researcher applies before prompting

GI-4GI-5GI-6GI-7GI-8GI-15

Researcher applies during synthesis

GI-9GI-10GI-11GI-12

Researcher applies to output

Level 2Agent SkillsPre- and at-generation GIs implemented as task harness profile specifications. Post-generation GIs as verifier agents.

Handled at L3 in mature deployment

GI-5GI-6GI-7GI-15

Baked into task harness profiles

GI-11GI-12

Dedicated verifier agents

Level 3Runtime Harness LayerPre-generation GIs implemented at the infrastructure layer, transparent to the calling agent.

GI-1GI-2GI-3GI-14

Transparent to calling agent

Agent-level; not harness-addressable

GI-13 · Session-Level

Session Reset — Researcher Practice OnlyStays at Level 1 regardless of deployment maturity. Requires awareness of session state and a conscious re-grounding decision — cannot be automated without reintroducing the framing it is meant to clear.

Forward Reference

The agent skill implementations are being developed as part of the Franklin harness engineering project [link when published]. The runtime harness layer implementation — transparent middleware intercepting research-task API calls — is being investigated as part of the PYGMY infrastructure harness project [link when published]. The research grounding framework publishes the specifications; the harness engineering projects publish the implementations. They are designed to be read together.

VIII. Open Questions and Next Versions

Several things remain unresolved, and naming them is part of showing work.

FM-6 Mechanism B pre-generation gap. Whether a prompt-level intervention exists for unfaithful CoT, or whether this requires architectural or training-level solutions outside the framework’s scope. The representation-level evidence from Arcuschin et al. — biases encoded before reasoning begins — suggests the gap may be permanent at the prompt level.

Pre-generation framing constraint. A candidate GI requiring explicit enumeration of what the query is not asking — analogous to what GI-1 does for sources but applied to the frame. This would provide first pre-generation coverage for FM-8 and FM-9.

FM-7 primary source gap. Omission as a named failure mode has weaker primary empirical grounding than the other eight. Shi et al. (2026) provides structural support; a dedicated empirical study of systematic omission in AI-assisted research would upgrade the coverage.

Agentic failure modes. Memory poisoning, adversarial prompt injection, cross-agent trust escalation have no clear QRP analog and no current FM designation. This framework’s scope is single interaction (single agent/chat app) AI-assisted research; multi-agent research pipelines require separate treatment. The asymmetry between the QRP-mapped failure modes and the novel agentic failure modes is real and should be named in any extension.

Empirical validation. The matrix coverage designations are derived from first principles and source analysis, not from systematic testing against a corpus of failure cases. A validation study measuring intervention effectiveness against documented failures would upgrade the T2 coverage claims to T1.

Companion Pieces

Interactive FM × GI Matrix — an interactive version of the full matrix with entry points by FM, GI, and process stage. Designed to match the Authority × Autonomy Matrix artifact from the harness engineering framework.

Sources and Attribution Appendix — full intellectual lineage of every FM and GI claim with T1/T2/T3/T4 tagging. Demonstrates GI-3 in practice on the framework itself.

Source Validation Log — the working validation document, published as a companion to demonstrate the grounding process. Includes documented errors, corrections, and outstanding validation items.

Michael Bilka, PhD is the founder of The Scurry Lab, a human-AI teaming lab building in public. The lab’s thesis: that intentional, bounded, demonstrably positive human-AI collaboration is an engineering problem, not a philosophical one.

Works Cited

Altshuller, G. (1996). And Suddenly the Inventor Appeared: TRIZ, the Creative Problem Solving Process. Technical Innovation Center.

Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., & Conmy, A. (2026). Chain-of-thought reasoning in the wild is not always faithful. Proceedings of the 43rd International Conference on Machine Learning (ICML 2026). arXiv:2503.08679.

Batista, R.M., & Griffiths, T.L. (2026). A rational analysis of the effects of sycophantic AI. arXiv:2602.14270. DOI: 10.48550/arXiv.2602.14270. Preprint.

Bianchini, S., Di Girolamo, V., Ravet, J., & Arranz, D. (2026). AI in science: When and where it makes a difference. Research Policy. DOI: 10.1016/j.respol.2026.105230. [Author verification: confirmed. DOI: inferred from ScienceDirect URL — verify before final publication.]

Fernandes, D., Villa, S., Nicholls, S., Haavisto, O., Buschek, D., Schmidt, A., Kosch, T., Shen, C., & Welsch, R. (2025). AI makes you smarter but none the wiser: The disconnect between performance and metacognition. Computers in Human Behavior, 175, 108779. DOI: 10.1016/j.chb.2025.108779.

GPTZero. (2026). GPTZero analysis of ICLR 2026 early submissions: Hallucinated citation rates. GPTZero. [Company analysis report — not peer-reviewed. Cite as: GPTZero, 2026.]

Kim, J. Holbrook, E. Huron, J. & Parsons, J. (2026). Beyond AI psychosis and sycophancy: Structural drift as a system-level safety failure. medRxiv. DOI: 10.64898/2026.03.19.26346371. Preprint. [Author names require verification before final publication.]

Kosch, T., & Feger, S. (2026). Prompt-hacking: The new p-hacking? Communications of the ACM, 69(3), 35–37. DOI: 10.1145/3744911. [Opinion piece — cite as argument, not empirical finding.]

Malliaraki, E., & Berditchevskaia, A. (2023). Combining collective and machine intelligence at the knowledge frontier. In OECD, Artificial Intelligence in Science: Challenges, Opportunities and the Future of Research. OECD Publishing.

Sun, M., Choi, S., Yin, Y. (2026). AI predictions and the expansion of scientific frontiers: Evidence from structural biology. bioRxiv. DOI: 10.64898/2026.04.06.716821. Preprint.

Rainford, P.F., Occhipinti, A., Wang, B. et al. Knowledge preservation in the era of big science and AI: strategies for sustainable scientific research. Nat Commun 17, 4069 (2026). https://doi.org/10.1038/s41467-026-72667-3

Pangram Labs. (2026). Pangram Labs analysis of ICLR 2026 peer reviews: AI-generated review rates. Pangram Labs. [Company analysis report — not peer-reviewed. Cite as: Pangram Labs, 2026.]

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., Durmus, E., Hatfield-Dodds, Z., Johnston, S.R., Kravec, S.M., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., & Perez, E. (2024). Towards understanding sycophancy in language models. International Conference on Learning Representations (ICLR 2024). arXiv:2310.13548.

Shi, J., Zhang, T.J., Jin, Z., & Conitzer, V. (2026). From hallucination to scheming: A unified taxonomy and benchmark analysis for LLM deception. arXiv:2604.04788.

Soffer, S., Sorin, V., Nadkarni, G.N., & Klang, E. (2025). Pitfalls of large language models in medical ethics reasoning. npj Digital Medicine, 8, 461. DOI: 10.1038/s41746-025-01792-y.

Turpin, M., Michael, J., Perez, E., & Bowman, S. (2023). Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 74952–74965. (NeurIPS 2023).

van den Akker, O.R., van Assen, M.A.L.M., Bakker, M., Elsherif, M., Wong, T.K., & Wicherts, J.M. (2024). Preregistration in practice: A comparison of preregistered and non-preregistered studies in psychology. Behavior Research Methods, 56, 5424–5433. DOI: 10.3758/s13428-023-02277-0.