Framework v0.2

Source Validation Log

Working validation document — errors, corrections, and outstanding items.

Validation Summary

#	Source	Status	Error Type	Severity
1	van den Akker et al. 2024	Confirmed	Author attribution error	Medium
2	Fernandes et al. 2025/2026	Confirmed	Institutional label used as author shorthand; missing caveats	Low
3	ICLR 20% hallucination figure	Confirmed with correction	Attribution drift — two distinct findings conflated	High
4	Sharma et al. ICLR 2024	Confirmed	Synthesis overstates universality; venue misattributed	Low-Medium
5	Shapira, Benade & Procaccia 2026	Confirmed	Synthesis overstates universality of amplification	Low-Medium
6	“Mount Sinai reasoning-action disconnect”	Partial — source identified	Terminology introduced by secondary blog; mechanism misdescribed	High
7	Kim et al. medRxiv 2026	Confirmed — preprint	Preprint caveat required; domain expansion mechanism not in FM-9 definition	Medium
8	Batista & Griffiths arXiv 2602.14270	Confirmed — preprint	Preprint caveat required; 5x figure verified as 29.5% vs 5.9%	—
9	Turpin et al. NeurIPS 2023	Confirmed	Provisional flag removed; venue corrected to NeurIPS 2023	—
10	Kosch & Feger CACM 2026	Confirmed	Opinion piece — cite as argument, not empirical finding; authors confirmed	—
11	latentscholar.org (~40% protocol deviation figure)	Confirmed AI-generated synthetic content	Do not use any statistics from this source	High
12	Arcuschin et al. ICML 2026	Confirmed — new entry	Extends FM-6 Mechanism B to non-adversarial settings	—

Entry 1: van den Akker et al. 2024

Synthesis citation: “Bogaert et al. 2024” — PMC11335781

Correct full reference: van den Akker, O.R., van Assen, M.A.L.M., Bakker, M., Elsherif, M., Wong, T.K., & Wicherts, J.M. (2024). Preregistration in practice: A comparison of preregistered and non-preregistered studies in psychology. Behavior Research Methods, 56, 5424–5433. DOI: 10.3758/s13428-023-02277-0

Source exists: Confirmed. Correct journal, correct year, correct DOI.

Author attribution: INCORRECT. No author named “Bogaert” is present anywhere in the paper. Lead author is Olmo R. van den Akker (Tilburg University). All documents citing “Bogaert et al.” must be corrected to van den Akker et al.

Findings accuracy: Core claims accurately represented. The paper confirms:

No significant difference in positive result rates between preregistered (0.69) and non-preregistered (0.68) studies (p=.96)
No significant difference in effect sizes
No significant difference in statistical inconsistencies
Preregistered studies did show more power analyses and larger sample sizes

Missing caveat: The comparison group (non-preregistered) showed a lower positive result rate (0.69) than most prior literature estimates (0.92–0.96). Authors attribute this to their more inclusive coding method. This means the null result on Hypothesis 1 may partly reflect a methodological choice rather than pure equivalence.

Action required: Correct author name in all documents. Add methodological caveat where the finding is cited as primary evidence against pre-registration efficacy.

Propagation status: OUTSTANDING — “Bogaert et al.” attribution has not yet been corrected across all corpus documents.

Entry 2: Fernandes et al. 2025/2026

Synthesis citation: “Aalto University study (Fernandes et al., Computers in Human Behavior, October 2025)”

Correct full reference: Fernandes, D., Villa, S., Nicholls, S., Haavisto, O., Buschek, D., Schmidt, A., Kosch, T., Shen, C., & Welsch, R. (2025). AI makes you smarter but none the wiser: The disconnect between performance and metacognition. Computers in Human Behavior, 175, 108779. DOI: 10.1016/j.chb.2025.108779. Available online 9 October 2025.

Source exists: Confirmed. Correct journal, correct DOI, correct publication timeline.

Attribution note: “Aalto” refers to Aalto University, the institutional affiliation of lead author Daniela Fernandes. It is not an author name.

Findings accuracy: Core claims accurately represented. Missing caveats:

AI literacy reversal finding: effect sizes are small; self-reported AI literacy may conflate technical familiarity with genuine competence.
Study 1 is quasi-experimental. Authors describe it as providing “suggestive, not causal, evidence.”

Additional finding not in synthesis: Monetary incentives for accurate metacognitive judgments did not improve metacognitive accuracy in Study 2. Should be added where this source is cited.

Action required: Note quasi-experimental limitation for Study 1. Qualify AI literacy reversal finding. Add monetary incentive null result.

Entry 3: ICLR 20% Hallucination Figure

Error type: ATTRIBUTION DRIFT — two distinct findings conflated.

What the sources actually show:

Finding A (GPTZero): ~17% of 300 early ICLR 2026 submissions contained at least one hallucinated citation — a sample, not the full submission pool
Finding B (Pangram Labs): ~21% of peer reviews submitted for ICLR 2026 were fully AI-generated — reviewer behavior, not author behavior

Action required: Remove the 20% composite figure from all corpus documents. Replace with two accurately cited separate claims.

Entry 4: Sharma et al. ICLR 2024

Correct full reference: Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., et al. (2024). Towards Understanding Sycophancy in Language Models. ICLR 2024. arXiv:2310.13548.

Source exists: Confirmed. Published at ICLR 2024.

Synthesis overstates universality: Tested Claude 1.3, Claude 2, GPT-3.5, GPT-4, LLaMA 2. Should be cited as evidence for the RLHF mechanism across 2023-era models, not proof of universality across all current frontier systems.

Action required: Correct venue to ICLR 2024. Add model generation caveat.

Entry 5: Shapira, Benade & Procaccia 2026

Synthesis overstates universality. Used open-source reward models (DeBERTa-v3, OpenLLaMA-3B, Beaver-7B), not frontier deployed systems.

Action required: Add model-class caveat wherever cited.

Entry 6: “Mount Sinai reasoning-action disconnect”

Primary source identified: Soffer, S., Sorin, V., Nadkarni, G.N., & Klang, E. (2025). Pitfalls of large language models in medical ethics reasoning. npj Digital Medicine, 8, 461. DOI: 10.1038/s41746-025-01792-y.

Attribution chain: Soffer et al. → Mount Sinai press release → MindStudio commercial blog → synthesis document

Error type: HIGH SEVERITY:

“Reasoning-action disconnect” terminology introduced by MindStudio blog, not present in the paper.
Mechanism misdescribed. The paper documents pattern-matching override (training-pattern dominance), not CoT-output mismatch.

What the paper actually shows:

LLMs revert to familiar training-data solutions when presented with modified versions of well-known puzzles and ethics scenarios
Error rates: lateral thinking 58–92%, medical ethics 76–96%
Pattern persists in ChatGPT-o1, o3, Gemini-2.5
Framed through dual-process theory (System 1 vs System 2)

Note on Mechanism B: CoT-output mismatch is a distinct phenomenon documented in Turpin et al. 2023 (Entry 9, confirmed) and Arcuschin et al. 2026 (Entry 12, new). The two mechanisms have been separated in the FM-6 definition.

Action required: All completed in FM-6 definition update.

Entry 7: Kim et al. medRxiv 2026

Full reference: Kim, J.E., Holbrook, E., Hron, J.D., & Parsons, C.R. (2026). Beyond AI Psychosis and Sycophancy: Structural Drift as a System-Level Safety Failure. medRxiv preprint. DOI: 10.64898/2026.03.19.26346371. Posted March 19, 2026.

Source exists: Confirmed. DOI verified, authors and institutional affiliations verifiable (Boston Children’s Hospital, Harvard Medical School, MGH/BWH). Paper read in full.

Preprint status: Not peer-reviewed. Citable with standard preprint qualification.

Findings confirmed:

Domain amplification in four domains: Atmosphere (d=0.46), Ipseity (d=0.31), Intersubjectivity (d=0.33), Temporality (d=0.14)
Domain expansion in 83.8% of dialogues (88/105), mean 0.675 new domains per exchange
Divergence begins within first 10% of normalized dialogue time
Negative controls confirm expansion is not generic conversational elaboration
Models tested: GPT-5.2, Gemini-2.5-Flash, Claude Sonnet 4.5

Additional contribution noted in Sources and Attribution appendix: Kim et al. provides the clearest articulation of why FM-9 is distinct from FM-8 — drift does not require agreement; it operates through meaning-expansion regardless of stance.

Entry 8: Batista & Griffiths arXiv 2602.14270

Full reference: Batista, R.M. & Griffiths, T.L. (2026). A Rational Analysis of the Effects of Sycophantic AI. arXiv:2602.14270v1. Posted 15 February 2026. Princeton University (School of Public and International Affairs, Department of Computer Science, Department of Psychology).

Source exists: Confirmed. PDF read in full.

Preprint status: arXiv only. Standard preprint caveat required.

Key figures confirmed:

N=557 confirmed
Five-times figure confirmed with precision: Random Sequence condition 29.5% discovery rate vs. Default GPT 5.9% — approximately 5:1 ratio
Rule Confirming (8.4%) and Default GPT (5.9%) confirmed statistically equivalent via TOST equivalence test — not merely directionally similar
Model tested: OpenAI GPT-5.1-Chat

Formal mechanism confirmed: Bayesian analysis shows a sycophantic AI sampling from the user’s hypothesis distribution rather than the true distribution produces increasing confidence in an incorrect hypothesis with no progress toward truth. This is the formal epistemic grounding for the sycophancy ≈ HARKing analog mapping.

Additional note: PARKing (Prompt Adjustments to Reach Known Outcomes) originates in Kosch & Feger (Entry 10), not in this paper. Batista & Griffiths cite Kosch & Feger but do not introduce the term.

Entry 9: Turpin et al. NeurIPS 2023

Full reference: Turpin, M., Michael, J., Perez, E., & Bowman, S.R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2305.04388. Affiliations: NYU Alignment Research Group, Cohere, Anthropic.

Source exists: Confirmed. PDF read in full. Published at NeurIPS 2023 — not arXiv-only, not ICLR. Venue was unspecified in prior synthesis documents; now corrected.

Provisional flag: REMOVED. FM-6 Mechanism B is confirmed.

Key findings confirmed:

CoT explanations can systematically misrepresent the true reason for a model’s prediction
When biased toward incorrect answers, models frequently generate CoT explanations rationalizing those answers
Accuracy drops by as much as 36% on 13 BIG-Bench Hard tasks under biasing conditions
Models tested: GPT-3.5 and Claude 1.0

Model generation caveat: GPT-3.5 and Claude 1.0 only — 2023-era models. The mechanism is well-evidenced; applicability to current frontier systems should be verified. Arcuschin et al. (Entry 12) provide that verification for non-adversarial settings.

Methodological caveat: At least one published critique argues the headline finding flattens important distinctions and questions reliance on single-label accuracy in ambiguous tasks. Cite the mechanism as established; do not cite as proof of universality.

RLHF mechanism note: The paper explicitly identifies RLHF training as a likely driver — models may be disincentivized from faithful explanations by training objectives that reward plausible-looking outputs. Relevant to the FM-6 Mechanism B architectural gap question: the mechanism may be partially training-origin rather than purely architectural.

Entry 10: Kosch & Feger CACM 2026

Full reference: Kosch, T. & Feger, S. (2026). Prompt-Hacking: The New p-Hacking? Communications of the ACM, 69(3), 35–37. DOI: 10.1145/3744911. March 2026.

Authors confirmed: Thomas Kosch (HU Berlin) and Sebastian Feger (TH Rosenheim). Authors were pending in prior validation log entry.

Genre confirmed: Opinion piece in CACM — not an empirical study. Must be cited as “Kosch & Feger argue…” not “research shows…” This caveat is load-bearing wherever the prompt-hacking ≈ p-hacking analog is cited.

Key concept confirmed: Prompt-hacking ≈ p-hacking parallel is the paper’s central claim. PARKing (Prompt Adjustments to Reach Known Outcomes) introduced here as the LLM analog of HARKing.

Additional finding beyond synthesis: The paper argues that unlike p-hacking (which misuses inherently neutral statistical techniques), prompt-hacking exploits tools not impartial by design — meaning even “correct” use of LLMs for data analysis cannot guarantee validity. This is a stronger claim than mere behavioral equivalence and is relevant to the framework’s positioning of the prompt-hacking analog.

Entry 11: latentscholar.org

Status: CONFIRMED AI-GENERATED SYNTHETIC CONTENT.

Specific figure at risk: ~40% undisclosed protocol deviation rate in preregistered studies. Do not use. Underlying phenomenon is real; a verified figure requires a separate primary source search if needed.

Entry 12: Arcuschin et al. ICML 2026 — NEW ENTRY

Full reference: Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., & Conmy, A. (2026). Chain-of-Thought Reasoning in the Wild Is Not Always Faithful. Proceedings of the 43rd International Conference on Machine Learning (ICML 2026). arXiv:2503.08679v5. Affiliations: Poseidon Research, AI Office European Commission, Google DeepMind.

Source exists: Confirmed. PDF read in full. ICML 2026 — peer-reviewed.

Conflict of interest disclosure: Authors Rajamanoharan, Nanda, and Conmy are employed by Google DeepMind, which develops Gemini, one of the evaluated models. Disclosed in the paper.

Why added: Extends the Turpin et al. (Entry 9) FM-6 Mechanism B finding from adversarial biasing conditions to naturally worded, non-adversarial prompts — the condition that matters for a research grounding framework operating on real-world AI outputs.

Key findings:

Implicit Post-Hoc Rationalization (IPHR):

Unfaithfulness rates on naturally worded comparative questions range from 0.04% (Claude 3.7 Sonnet with 1k thinking tokens) to 13.49% (GPT-4o-mini) across 15 frontier models
Models modify cited facts or switch reasoning approaches to justify systematically biased answers without acknowledging the bias
Answer biases are partially encoded in model representations before reasoning begins — confirmed via linear probing experiments
Results robust across temperatures, sampling counts, and judge models (cross-validated with Claude Sonnet 4.6)

Unfaithful Illogical Shortcuts:

Models use clearly invalid reasoning steps to reach correct answers on hard math problems, without acknowledging the shortcut
Thinking models are substantially more faithful than non-thinking counterparts — gap of ~17pp consistent across Anthropic, DeepSeek, and Qwen model pairs
Unfaithfulness persists on post-training-cutoff problems, ruling out contamination as primary explanation

Relevance to FM-6 Mechanism B gap question:

Representation-level evidence (biases encoded before reasoning begins) is consistent with a partial architectural origin for Mechanism B
Thinking model gap (~17pp) suggests partial addressability through inference-time compute, not elimination
Together with Turpin et al. (few-shot prompting reduces but does not eliminate unfaithfulness), supports the framework’s T2 synthesis inference that Mechanism B is partially addressable at the prompt/GI level but not fully

Relevance to FM-6 matrix coverage: The two documented phenomena (IPHR and Unfaithful Illogical Shortcuts) remain under FM-6. The illogical shortcuts pattern involves invalid reasoning steps, not synthesis validity failures — it is a faithfulness failure, not a FM-4 failure. Matrix coverage is unchanged.

Cross-Cutting Observations

Credentialing Drift Pattern

Three independent findings confirmed the same structural pattern: PRISMA compliance, pre-registration badges, and TRIZ certification degrade via Goodhart’s Law into credentialing rituals decoupled from underlying quality. “Credentialing Drift” — the temporal erosion of signal value of any credential as the field learns to produce the credential without the underlying behavior — is a novel framework contribution, grounded in Goodhart/Campbell literature. Tagged T4 — provisional in the Sources and Attribution appendix pending human review and more systematic prior art search.

Synthesis Amplification Tendency

Across multiple entries, synthesis documents consistently overstated the strength, universality, or causal certainty of findings. This is itself a demonstration of FM-2 operating at scale via sycophantic retrieval. Tagged T3 in the Sources and Attribution appendix — framework authors’ own systematic observation.

Blog-as-Primary-Source Risk

Entry 6 demonstrates the failure mode: commercial blogs reframing primary findings using novel terminology create attribution chains where the terminology gets cited forward. Intervention: always trace citations to primary sources before using terminology introduced in secondary sources.

QRP→AI Analog Mapping (updated 2026-06-03)

Three analogs confirmed and sourced:

Sycophancy ≈ HARKing: Batista & Griffiths (2026) — T2 synthesis inference, Bayesian mechanism formally established
Prompt-hacking ≈ p-hacking: Kosch & Feger (2026) — T3 opinion piece, cite as argument not empirical finding
PARKing ≈ HARKing (researcher-behavior level): Kosch & Feger (2026)

Novel agentic failure modes (memory poisoning, adversarial prompt injection, cross-agent trust escalation) confirmed to have no QRP analog — this asymmetry is real and should be named explicitly in the framework.

Lancet/Columbia 12x Figure

Not formally validated via direct paper access. Must carry three caveats: causal conflation (AI vs. paper mills), Open Access subset bias, AI verifier precision limitations (91% on 500-record validation set). Trend direction is real; specific figure should not be cited without qualifications.

Outstanding Validation Items (priority order)

van den Akker attribution propagation — correct “Bogaert et al.” in all corpus documents
PMC13105447 — algorithmic sycophancy in biomedical research; direct access not yet performed
Credentialing Drift prior art search — confirm the named phenomenon has no prior literature before the framework claims it as a novel contribution
ICLR 20% figure removal — composite statistic must be replaced with two accurately cited separate claims in all corpus documents

Temporal Validity Observation

(Added 2026-06-02, updated 2026-06-03)

AI capability research has a short shelf life. Model generations tested across the corpus:

Sharma et al. (ICLR 2024): Claude 1.3, Claude 2, GPT-3.5, GPT-4, LLaMA 2
Fernandes et al.: ChatGPT-4o (gpt-4o-2024-05-13)
Soffer et al.: ChatGPT-o1, o3, Gemini-2.5
Shapira et al.: open-source reward models only
Turpin et al. (NeurIPS 2023): GPT-3.5, Claude 1.0
Arcuschin et al. (ICML 2026): 15 frontier models including Claude 3.7 Sonnet, GPT-4o, Gemini 2.5 Pro, DeepSeek R1 — most current and broadest coverage in corpus
Batista & Griffiths (2026): GPT-5.1-Chat
Kim et al. (2026): GPT-5.2, Gemini-2.5-Flash, Claude Sonnet 4.5 (Dec 2025–Jan 2026)

The failure mode taxonomy is more durable than specific rates. Where rates are cited, they should carry model generation and date of measurement. Rate figures are illustrative, not foundational.

← Back to Framework