Framework v0.2

Source Validation Log

Working validation document — errors, corrections, and outstanding items.

Validation Summary

#SourceStatusError TypeSeverity
1van den Akker et al. 2024ConfirmedAuthor attribution errorMedium
2Fernandes et al. 2025/2026ConfirmedInstitutional label used as author shorthand; missing caveatsLow
3ICLR 20% hallucination figureConfirmed with correctionAttribution drift — two distinct findings conflatedHigh
4Sharma et al. ICLR 2024ConfirmedSynthesis overstates universality; venue misattributedLow-Medium
5Shapira, Benade & Procaccia 2026ConfirmedSynthesis overstates universality of amplificationLow-Medium
6“Mount Sinai reasoning-action disconnect”Partial — source identifiedTerminology introduced by secondary blog; mechanism misdescribedHigh
7Kim et al. medRxiv 2026Confirmed — preprintPreprint caveat required; domain expansion mechanism not in FM-9 definitionMedium
8Batista & Griffiths arXiv 2602.14270Confirmed — preprintPreprint caveat required; 5x figure verified as 29.5% vs 5.9%
9Turpin et al. NeurIPS 2023ConfirmedProvisional flag removed; venue corrected to NeurIPS 2023
10Kosch & Feger CACM 2026ConfirmedOpinion piece — cite as argument, not empirical finding; authors confirmed
11latentscholar.org (~40% protocol deviation figure)Confirmed AI-generated synthetic contentDo not use any statistics from this sourceHigh
12Arcuschin et al. ICML 2026Confirmed — new entryExtends FM-6 Mechanism B to non-adversarial settings

Entry 1: van den Akker et al. 2024

Synthesis citation: “Bogaert et al. 2024” — PMC11335781

Correct full reference: van den Akker, O.R., van Assen, M.A.L.M., Bakker, M., Elsherif, M., Wong, T.K., & Wicherts, J.M. (2024). Preregistration in practice: A comparison of preregistered and non-preregistered studies in psychology. Behavior Research Methods, 56, 5424–5433. DOI: 10.3758/s13428-023-02277-0

Source exists: Confirmed. Correct journal, correct year, correct DOI.

Author attribution: INCORRECT. No author named “Bogaert” is present anywhere in the paper. Lead author is Olmo R. van den Akker (Tilburg University). All documents citing “Bogaert et al.” must be corrected to van den Akker et al.

Findings accuracy: Core claims accurately represented. The paper confirms:

  • No significant difference in positive result rates between preregistered (0.69) and non-preregistered (0.68) studies (p=.96)
  • No significant difference in effect sizes
  • No significant difference in statistical inconsistencies
  • Preregistered studies did show more power analyses and larger sample sizes

Missing caveat: The comparison group (non-preregistered) showed a lower positive result rate (0.69) than most prior literature estimates (0.92–0.96). Authors attribute this to their more inclusive coding method. This means the null result on Hypothesis 1 may partly reflect a methodological choice rather than pure equivalence.

Action required: Correct author name in all documents. Add methodological caveat where the finding is cited as primary evidence against pre-registration efficacy.

Propagation status: OUTSTANDING — “Bogaert et al.” attribution has not yet been corrected across all corpus documents.


Entry 2: Fernandes et al. 2025/2026

Synthesis citation: “Aalto University study (Fernandes et al., Computers in Human Behavior, October 2025)”

Correct full reference: Fernandes, D., Villa, S., Nicholls, S., Haavisto, O., Buschek, D., Schmidt, A., Kosch, T., Shen, C., & Welsch, R. (2025). AI makes you smarter but none the wiser: The disconnect between performance and metacognition. Computers in Human Behavior, 175, 108779. DOI: 10.1016/j.chb.2025.108779. Available online 9 October 2025.

Source exists: Confirmed. Correct journal, correct DOI, correct publication timeline.

Attribution note: “Aalto” refers to Aalto University, the institutional affiliation of lead author Daniela Fernandes. It is not an author name.

Findings accuracy: Core claims accurately represented. Missing caveats:

  1. AI literacy reversal finding: effect sizes are small; self-reported AI literacy may conflate technical familiarity with genuine competence.
  2. Study 1 is quasi-experimental. Authors describe it as providing “suggestive, not causal, evidence.”

Additional finding not in synthesis: Monetary incentives for accurate metacognitive judgments did not improve metacognitive accuracy in Study 2. Should be added where this source is cited.

Action required: Note quasi-experimental limitation for Study 1. Qualify AI literacy reversal finding. Add monetary incentive null result.


Entry 3: ICLR 20% Hallucination Figure

Error type: ATTRIBUTION DRIFT — two distinct findings conflated.

What the sources actually show:

  • Finding A (GPTZero): ~17% of 300 early ICLR 2026 submissions contained at least one hallucinated citation — a sample, not the full submission pool
  • Finding B (Pangram Labs): ~21% of peer reviews submitted for ICLR 2026 were fully AI-generated — reviewer behavior, not author behavior

Action required: Remove the 20% composite figure from all corpus documents. Replace with two accurately cited separate claims.


Entry 4: Sharma et al. ICLR 2024

Correct full reference: Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., et al. (2024). Towards Understanding Sycophancy in Language Models. ICLR 2024. arXiv:2310.13548.

Source exists: Confirmed. Published at ICLR 2024.

Synthesis overstates universality: Tested Claude 1.3, Claude 2, GPT-3.5, GPT-4, LLaMA 2. Should be cited as evidence for the RLHF mechanism across 2023-era models, not proof of universality across all current frontier systems.

Action required: Correct venue to ICLR 2024. Add model generation caveat.


Entry 5: Shapira, Benade & Procaccia 2026

Synthesis overstates universality. Used open-source reward models (DeBERTa-v3, OpenLLaMA-3B, Beaver-7B), not frontier deployed systems.

Action required: Add model-class caveat wherever cited.


Entry 6: “Mount Sinai reasoning-action disconnect”

Primary source identified: Soffer, S., Sorin, V., Nadkarni, G.N., & Klang, E. (2025). Pitfalls of large language models in medical ethics reasoning. npj Digital Medicine, 8, 461. DOI: 10.1038/s41746-025-01792-y.

Attribution chain: Soffer et al. → Mount Sinai press release → MindStudio commercial blog → synthesis document

Error type: HIGH SEVERITY:

  1. “Reasoning-action disconnect” terminology introduced by MindStudio blog, not present in the paper.
  2. Mechanism misdescribed. The paper documents pattern-matching override (training-pattern dominance), not CoT-output mismatch.

What the paper actually shows:

  • LLMs revert to familiar training-data solutions when presented with modified versions of well-known puzzles and ethics scenarios
  • Error rates: lateral thinking 58–92%, medical ethics 76–96%
  • Pattern persists in ChatGPT-o1, o3, Gemini-2.5
  • Framed through dual-process theory (System 1 vs System 2)

Note on Mechanism B: CoT-output mismatch is a distinct phenomenon documented in Turpin et al. 2023 (Entry 9, confirmed) and Arcuschin et al. 2026 (Entry 12, new). The two mechanisms have been separated in the FM-6 definition.

Action required: All completed in FM-6 definition update.


Entry 7: Kim et al. medRxiv 2026

Full reference: Kim, J.E., Holbrook, E., Hron, J.D., & Parsons, C.R. (2026). Beyond AI Psychosis and Sycophancy: Structural Drift as a System-Level Safety Failure. medRxiv preprint. DOI: 10.64898/2026.03.19.26346371. Posted March 19, 2026.

Source exists: Confirmed. DOI verified, authors and institutional affiliations verifiable (Boston Children’s Hospital, Harvard Medical School, MGH/BWH). Paper read in full.

Preprint status: Not peer-reviewed. Citable with standard preprint qualification.

Findings confirmed:

  • Domain amplification in four domains: Atmosphere (d=0.46), Ipseity (d=0.31), Intersubjectivity (d=0.33), Temporality (d=0.14)
  • Domain expansion in 83.8% of dialogues (88/105), mean 0.675 new domains per exchange
  • Divergence begins within first 10% of normalized dialogue time
  • Negative controls confirm expansion is not generic conversational elaboration
  • Models tested: GPT-5.2, Gemini-2.5-Flash, Claude Sonnet 4.5

Additional contribution noted in Sources and Attribution appendix: Kim et al. provides the clearest articulation of why FM-9 is distinct from FM-8 — drift does not require agreement; it operates through meaning-expansion regardless of stance.


Entry 8: Batista & Griffiths arXiv 2602.14270

Full reference: Batista, R.M. & Griffiths, T.L. (2026). A Rational Analysis of the Effects of Sycophantic AI. arXiv:2602.14270v1. Posted 15 February 2026. Princeton University (School of Public and International Affairs, Department of Computer Science, Department of Psychology).

Source exists: Confirmed. PDF read in full.

Preprint status: arXiv only. Standard preprint caveat required.

Key figures confirmed:

  • N=557 confirmed
  • Five-times figure confirmed with precision: Random Sequence condition 29.5% discovery rate vs. Default GPT 5.9% — approximately 5:1 ratio
  • Rule Confirming (8.4%) and Default GPT (5.9%) confirmed statistically equivalent via TOST equivalence test — not merely directionally similar
  • Model tested: OpenAI GPT-5.1-Chat

Formal mechanism confirmed: Bayesian analysis shows a sycophantic AI sampling from the user’s hypothesis distribution rather than the true distribution produces increasing confidence in an incorrect hypothesis with no progress toward truth. This is the formal epistemic grounding for the sycophancy ≈ HARKing analog mapping.

Additional note: PARKing (Prompt Adjustments to Reach Known Outcomes) originates in Kosch & Feger (Entry 10), not in this paper. Batista & Griffiths cite Kosch & Feger but do not introduce the term.


Entry 9: Turpin et al. NeurIPS 2023

Full reference: Turpin, M., Michael, J., Perez, E., & Bowman, S.R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2305.04388. Affiliations: NYU Alignment Research Group, Cohere, Anthropic.

Source exists: Confirmed. PDF read in full. Published at NeurIPS 2023 — not arXiv-only, not ICLR. Venue was unspecified in prior synthesis documents; now corrected.

Provisional flag: REMOVED. FM-6 Mechanism B is confirmed.

Key findings confirmed:

  • CoT explanations can systematically misrepresent the true reason for a model’s prediction
  • When biased toward incorrect answers, models frequently generate CoT explanations rationalizing those answers
  • Accuracy drops by as much as 36% on 13 BIG-Bench Hard tasks under biasing conditions
  • Models tested: GPT-3.5 and Claude 1.0

Model generation caveat: GPT-3.5 and Claude 1.0 only — 2023-era models. The mechanism is well-evidenced; applicability to current frontier systems should be verified. Arcuschin et al. (Entry 12) provide that verification for non-adversarial settings.

Methodological caveat: At least one published critique argues the headline finding flattens important distinctions and questions reliance on single-label accuracy in ambiguous tasks. Cite the mechanism as established; do not cite as proof of universality.

RLHF mechanism note: The paper explicitly identifies RLHF training as a likely driver — models may be disincentivized from faithful explanations by training objectives that reward plausible-looking outputs. Relevant to the FM-6 Mechanism B architectural gap question: the mechanism may be partially training-origin rather than purely architectural.


Entry 10: Kosch & Feger CACM 2026

Full reference: Kosch, T. & Feger, S. (2026). Prompt-Hacking: The New p-Hacking? Communications of the ACM, 69(3), 35–37. DOI: 10.1145/3744911. March 2026.

Authors confirmed: Thomas Kosch (HU Berlin) and Sebastian Feger (TH Rosenheim). Authors were pending in prior validation log entry.

Genre confirmed: Opinion piece in CACM — not an empirical study. Must be cited as “Kosch & Feger argue…” not “research shows…” This caveat is load-bearing wherever the prompt-hacking ≈ p-hacking analog is cited.

Key concept confirmed: Prompt-hacking ≈ p-hacking parallel is the paper’s central claim. PARKing (Prompt Adjustments to Reach Known Outcomes) introduced here as the LLM analog of HARKing.

Additional finding beyond synthesis: The paper argues that unlike p-hacking (which misuses inherently neutral statistical techniques), prompt-hacking exploits tools not impartial by design — meaning even “correct” use of LLMs for data analysis cannot guarantee validity. This is a stronger claim than mere behavioral equivalence and is relevant to the framework’s positioning of the prompt-hacking analog.


Entry 11: latentscholar.org

Status: CONFIRMED AI-GENERATED SYNTHETIC CONTENT.

Specific figure at risk: ~40% undisclosed protocol deviation rate in preregistered studies. Do not use. Underlying phenomenon is real; a verified figure requires a separate primary source search if needed.


Entry 12: Arcuschin et al. ICML 2026 — NEW ENTRY

Full reference: Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., & Conmy, A. (2026). Chain-of-Thought Reasoning in the Wild Is Not Always Faithful. Proceedings of the 43rd International Conference on Machine Learning (ICML 2026). arXiv:2503.08679v5. Affiliations: Poseidon Research, AI Office European Commission, Google DeepMind.

Source exists: Confirmed. PDF read in full. ICML 2026 — peer-reviewed.

Conflict of interest disclosure: Authors Rajamanoharan, Nanda, and Conmy are employed by Google DeepMind, which develops Gemini, one of the evaluated models. Disclosed in the paper.

Why added: Extends the Turpin et al. (Entry 9) FM-6 Mechanism B finding from adversarial biasing conditions to naturally worded, non-adversarial prompts — the condition that matters for a research grounding framework operating on real-world AI outputs.

Key findings:

Implicit Post-Hoc Rationalization (IPHR):

  • Unfaithfulness rates on naturally worded comparative questions range from 0.04% (Claude 3.7 Sonnet with 1k thinking tokens) to 13.49% (GPT-4o-mini) across 15 frontier models
  • Models modify cited facts or switch reasoning approaches to justify systematically biased answers without acknowledging the bias
  • Answer biases are partially encoded in model representations before reasoning begins — confirmed via linear probing experiments
  • Results robust across temperatures, sampling counts, and judge models (cross-validated with Claude Sonnet 4.6)

Unfaithful Illogical Shortcuts:

  • Models use clearly invalid reasoning steps to reach correct answers on hard math problems, without acknowledging the shortcut
  • Thinking models are substantially more faithful than non-thinking counterparts — gap of ~17pp consistent across Anthropic, DeepSeek, and Qwen model pairs
  • Unfaithfulness persists on post-training-cutoff problems, ruling out contamination as primary explanation

Relevance to FM-6 Mechanism B gap question:

  • Representation-level evidence (biases encoded before reasoning begins) is consistent with a partial architectural origin for Mechanism B
  • Thinking model gap (~17pp) suggests partial addressability through inference-time compute, not elimination
  • Together with Turpin et al. (few-shot prompting reduces but does not eliminate unfaithfulness), supports the framework’s T2 synthesis inference that Mechanism B is partially addressable at the prompt/GI level but not fully

Relevance to FM-6 matrix coverage: The two documented phenomena (IPHR and Unfaithful Illogical Shortcuts) remain under FM-6. The illogical shortcuts pattern involves invalid reasoning steps, not synthesis validity failures — it is a faithfulness failure, not a FM-4 failure. Matrix coverage is unchanged.


Cross-Cutting Observations

Credentialing Drift Pattern

Three independent findings confirmed the same structural pattern: PRISMA compliance, pre-registration badges, and TRIZ certification degrade via Goodhart’s Law into credentialing rituals decoupled from underlying quality. “Credentialing Drift” — the temporal erosion of signal value of any credential as the field learns to produce the credential without the underlying behavior — is a novel framework contribution, grounded in Goodhart/Campbell literature. Tagged T4 — provisional in the Sources and Attribution appendix pending human review and more systematic prior art search.

Synthesis Amplification Tendency

Across multiple entries, synthesis documents consistently overstated the strength, universality, or causal certainty of findings. This is itself a demonstration of FM-2 operating at scale via sycophantic retrieval. Tagged T3 in the Sources and Attribution appendix — framework authors’ own systematic observation.

Blog-as-Primary-Source Risk

Entry 6 demonstrates the failure mode: commercial blogs reframing primary findings using novel terminology create attribution chains where the terminology gets cited forward. Intervention: always trace citations to primary sources before using terminology introduced in secondary sources.

QRP→AI Analog Mapping (updated 2026-06-03)

Three analogs confirmed and sourced:

  • Sycophancy ≈ HARKing: Batista & Griffiths (2026) — T2 synthesis inference, Bayesian mechanism formally established
  • Prompt-hacking ≈ p-hacking: Kosch & Feger (2026) — T3 opinion piece, cite as argument not empirical finding
  • PARKing ≈ HARKing (researcher-behavior level): Kosch & Feger (2026)

Novel agentic failure modes (memory poisoning, adversarial prompt injection, cross-agent trust escalation) confirmed to have no QRP analog — this asymmetry is real and should be named explicitly in the framework.

Lancet/Columbia 12x Figure

Not formally validated via direct paper access. Must carry three caveats: causal conflation (AI vs. paper mills), Open Access subset bias, AI verifier precision limitations (91% on 500-record validation set). Trend direction is real; specific figure should not be cited without qualifications.


Outstanding Validation Items (priority order)

  1. van den Akker attribution propagation — correct “Bogaert et al.” in all corpus documents
  2. PMC13105447 — algorithmic sycophancy in biomedical research; direct access not yet performed
  3. Credentialing Drift prior art search — confirm the named phenomenon has no prior literature before the framework claims it as a novel contribution
  4. ICLR 20% figure removal — composite statistic must be replaced with two accurately cited separate claims in all corpus documents

Temporal Validity Observation

(Added 2026-06-02, updated 2026-06-03)

AI capability research has a short shelf life. Model generations tested across the corpus:

  • Sharma et al. (ICLR 2024): Claude 1.3, Claude 2, GPT-3.5, GPT-4, LLaMA 2
  • Fernandes et al.: ChatGPT-4o (gpt-4o-2024-05-13)
  • Soffer et al.: ChatGPT-o1, o3, Gemini-2.5
  • Shapira et al.: open-source reward models only
  • Turpin et al. (NeurIPS 2023): GPT-3.5, Claude 1.0
  • Arcuschin et al. (ICML 2026): 15 frontier models including Claude 3.7 Sonnet, GPT-4o, Gemini 2.5 Pro, DeepSeek R1 — most current and broadest coverage in corpus
  • Batista & Griffiths (2026): GPT-5.1-Chat
  • Kim et al. (2026): GPT-5.2, Gemini-2.5-Flash, Claude Sonnet 4.5 (Dec 2025–Jan 2026)

The failure mode taxonomy is more durable than specific rates. Where rates are cited, they should carry model generation and date of measurement. Rate figures are illustrative, not foundational.

← Back to Framework