Mirror for Hypothes.is annotation PDF Main Site Source

Just Ask the Model: One-Shot LLM Research Evaluation and Structured Expert Review

Authors
Affiliations

Valentin Klotzbücher

University of Basel & University Hospital Basel

David Reinstein

The Unjournal

Lorenzo Pacchiardi

University of Cambridge, Leverhulme Centre for the Future of Intelligence

Tianmai Michael Zhang

University of Washington

Published

February 26, 2026

Abstract

Peer review is strained, and AI tools generating referee-like feedback are already adopted by researchers and commercial services—yet field evidence on how reliably frontier LLMs can evaluate research remains scarce. We compare structured evaluations produced by six frontier LLMs against paid expert review packages from The Unjournal, an open evaluation platform covering economics and social-science working papers, where both humans and models rate papers on seven percentile criteria with uncertainty intervals and provide narrative critiques. Treating human evaluations as a high-quality but noisy reference signal, we find that top reasoning-capable models often approach the ceiling implied by human inter-rater variability on several criteria, while exhibiting consistent failure modes: compressed rating scales, uneven criterion coverage, and variable identification of expert-flagged concerns. Our results support AI as structured screening and decision support rather than full automation, and motivate stability checks and stronger safeguards against adversarial manipulation.

Introduction

Include global setup and parameters
source("setup_params.R")
CautionWorking paper

This is an evolving working paper. Analysis, metrics, and comparisons are under active development.

This work is a collaboration with The Unjournal, which has published 55+ detailed evaluation packages building open, quantitative evaluation infrastructure for global-priorities-relevant research in economics, policy, and social science. We thank The Unjournal’s evaluation community for generating the structured human assessments that make this comparison possible.

Funding for The Unjournal has been provided by the Survival and Flourishing Fund, the Long Term Future Fund, and EA Funds.

Peer review is under strain. Reviewers are hard to find, turnaround times are lengthening, and the system costs an estimated $1.5 billion per year in the United States alone (Aczel, Szaszi, and Holcombe 2021). At the same time, generative AI lowers the cost of producing polished manuscripts; in at least some fields, editors report submission growth that exceeds reviewer capacity, and explicitly link this trend to LLM-assisted writing (Spitzer 2026). This combination creates demand for automated support in editorial and pre-submission workflows.

AI “reviewer” products are already marketed directly to authors. For example, Refine offers automated, comment-style feedback on drafts and emphasizes that it is not a comprehensive replacement for content development or fact checking (Refine n.d.). IsItCredible (“Reviewer 2”) similarly sells automated referee-like reports, explicitly states that the service is not a substitute for human peer review, and claims uploaded files are deleted after delivery (IsItCredible.com n.d.). These commercial tools likely use bespoke multi-step or agentic pipelines, though their internal architectures are not publicly documented. Meanwhile, publishers are formalizing policies that treat manuscripts and reviews as confidential and prohibit reviewers from uploading them into general-purpose generative AI tools, while permitting limited, controlled uses of AI for screening tasks (eg, completeness and plagiarism checks) within publisher-managed systems (Elsevier 2025; Leung 2026).

These developments make the evidentiary gap salient: funders, editors, and policymakers need to know when AI evaluation outputs are trustworthy enough to use, and when they are unstable, biased, or manipulable. Recent work highlights all three concerns. First, reproducibility can be “jagged”: repeated runs of the same models on the same corpus over time can be highly consistent for some tasks and models, but much less so for others (Thomas, Romasanta, and Pujol Priego 2026); robustness may require separating scientific judgment from computational execution (Xu and Yang 2026); and even without overt adversarial intent, subtle reframings of the same task can induce systematic shifts in outputs—a form of LLM “specification search”—raising concerns about frame-sensitive biases when models serve as measurement instruments (Asher et al. 2026). Second, adversarial manipulation is not hypothetical: invisible-text “prompt injection” can substantially inflate LLM-assigned review scores and acceptance recommendations in simulated peer review (Choi et al. 2026), and prompt-injection vulnerabilities are also documented in other high-stakes advice settings (Lee et al. 2025). Third, even when outputs look fluent and plausible, it remains unclear whether AI models approximate expert judgment: AI-generated reviews tend to cover more surface-level sections while being less thematically diverse and less focused on interpretation, originality, and applicability than human reviews (Rajakumar et al. 2026); LLMs used as manuscript quality checkers identify only a small fraction of confirmed critical errors even with the strongest reasoning models (Zhang and Abernethy 2025); and LLM scoring exhibits systematic range restriction and halo effects that can distort agreement metrics (Wang et al. 2025).

The central question we address is therefore: how reliably can frontier LLMs evaluate research, relative to expert peer review and under realistic levels of rater disagreement? We study this question in a setting designed to make “expert judgment” observable and multi-dimensional rather than implicit.

We use The Unjournal’s structured human evaluations as a reference signal. We prompt six frontier LLMs—GPT-5 Pro, GPT-5.2 Pro, GPT-4o-mini, Claude Sonnet 4, Claude Opus 4.61, and Gemini 2.0 Flash—with the same rubric and guidelines used by human evaluators, then compare the resulting quantitative ratings and qualitative critiques against expert evaluations for 60 economics and social-science working papers. We ask whether frontier AI evaluations can approximate expert judgment, which models perform best across rating criteria and cost tiers, and whether systematic differences reveal characteristic AI preferences over research. Our headline finding is that the best-performing model (GPT-5 Pro) matches or exceeds pairwise human inter-rater rank agreement on overall quality, making a strong case that frontier LLMs can serve as additional expert raters in structured evaluation pipelines, even under our deliberately minimal one-shot setup.

Our approach is deliberately minimal: each model receives the same PDF and a fixed rubric in a single prompt, with no iteration, retrieval augmentation, chain-of-thought scaffolding, or multi-step agentic loop. This makes our results a conservative lower bound on what LLM-based evaluation can achieve. If frontier models already yield meaningful agreement with expert reviewers under the simplest possible setup, more sophisticated pipelines—structured measurement schemas (Asirvatham, Mokski, and Shleifer 2026), iterative quality-checking workflows (Zhang and Abernethy 2025), or the kind of prompt-robustness engineering motivated by specification-search concerns (Asher et al. 2026)—should improve further. Quantifying how much headroom remains above this one-shot baseline, and which pipeline elements unlock it, is a key direction for future work.

The Unjournal setting is particularly well suited for this comparison. It commissions paid expert evaluations using a structured rubric covering seven percentile criteria with 90% credible intervals plus journal-tier predictions, and publishes the resulting packages openly—reducing classic gatekeeping motives and increasing reviewer effort. The resulting ratings and critiques still exhibit substantial inter-rater variation; accordingly, we treat human evaluations as a high-quality but noisy reference signal, not ground truth. The rich, multi-dimensional data allow us to compare the priorities and calibration of humans and AI models across criteria and domains, while the journal-tier predictions provide an external reference point2 enabling a human-vs-LLM horse race. Finally, The Unjournal’s pipeline of future evaluations allows for clean out-of-training-data predictions, serving as a live testing lab for prospective validation.


  1. Claude Opus 4.6 was evaluated without extended thinking enabled; results for this model reflect standard inference and likely understate its ceiling capability. A re-run with extended thinking is planned.↩︎

  2. These represent verifiable publication outcomes, not statements about the “true quality” of the paper.↩︎