References

Aczel, Balazs, Barnabas Szaszi, and Alex O Holcombe, “A billion-dollar donation: Estimating the cost of researchers’ time spent on peer review,” Research integrity and peer review, 6 (2021), 1–8 (Springer).

Asher, Samuel G. Z., Janet Malzahn, Jessica M. Persano, Elliot J. Paschal, Andrew C. W. Myers, and Andrew B. Hall, “Do claude code and codex p-hack? Sycophancy and statistical analysis in large language models,” 2026.

Asirvatham, Hemanth, Elliott Mokski, and Andrei Shleifer, “GPT as a measurement tool,” {NBER} Working Paper, 2026 (National Bureau of Economic Research).

Choi, Byungjin, Tae Joon Jun, Joung Won Sung, Il Woo Park, Jeong-Moo Lee, Soo Ick Cho, Hyung Jun Park, Ro Woon Lee, and Jungyo Suh, “Invisible text injection and peer review by AI models,” JAMA Network Open, 9 (2026), e2552099.

Elsevier, “Generative AI policies for journals” (Feb. 19, 2026).

IsItCredible.com, “Is it credible?” (Feb. 19, 2026).

Lee, Ro Woon, Tae Joon Jun, Jeong-Moo Lee, Soo Ick Cho, Hyung Jun Park, and Jungyo Suh, “Vulnerability of large language models to prompt injection when providing medical advice,” JAMA Network Open, 8 (2025), e2549963.

Leung, Tiffany I., “LLMs in peer review—how publishing policies must advance,” JAMA Network Open, 9 (2026), e2552042.

Pataranutaporn, Pat, Nattavudh Powdthavee, Chayapatr Achiwaranguprok, and Pattie Maes, “Can AI solve the peer review crisis? A large scale cross model experiment of LLMs’ performance and biases in evaluating over 1000 economics papers,” 2025.

Rajakumar, Hamrish Kumar, Kailash Abhishek Sankaran, Manasi Pillai Ashok, and Srinivas Rachoori, “Peer review in the age of artificial intelligence: A comparative study of human and AI-generated review reports,” Postgraduate Medical Journal, (2026), qgag005.

Refine, “FAQ - refine” (Feb. 19, 2026).

Spitzer, Markus Wolfgang Hermann, “The emerging submission crisis in behavioral science,” Trends in Neuroscience and Education, 42 (2026), 100276.

Thomas, Llewellyn D. W., Angelo Kenneth G. Romasanta, and Laia Pujol Priego, “Jagged competencies: Measuring the reliability of generative AI in academic research,” Journal of Business Research, 203 (2026), 115804.

Wang, Yuehan, Jinyan Huang, Lun Du, Yuxin Guo, Ying Liu, and Rong Wang, “Evaluating large language models as raters in large-scale writing assessments: A psychometric framework for reliability and validity,” Computers and Education: Artificial Intelligence, 9 (2025), 100481.

Xu, Yiqing, and Leo Y. Yang, “Scaling reproducibility: An AI-assisted workflow for large-scale reanalysis,” 2026.

Zhang, Tianmai M, and Neil F Abernethy, “Reviewing scientific papers for critical problems with reasoning LLMs: Baseline approaches and automatic evaluation,” arXiv preprint arXiv:2505.23824, (2025).