Mirror for Hypothes.is annotation PDF Main Site Source

Discussion

Frontier LLMs achieve moderate-to-strong agreement with human expert evaluators, approaching the ceiling set by inter-rater variability among humans themselves. Correlations with human mean ratings are highest for reasoning-capable models (GPT-5 Pro, Claude Opus 4.6), though even lightweight models (GPT-4o-mini, Gemini 2.0 Flash) perform respectably at a fraction of the cost. Central compression—the tendency for LLMs to pull extreme ratings toward the middle of the scale—is the most consistent pattern across all models, likely reflecting alignment training that discourages confident extreme outputs. Qualitative coverage varies widely across papers: on some, the LLM captures nearly all consensus human concerns; on others, it misses key critiques or raises issues absent from the expert consensus.

Limitations. Several caveats temper these conclusions. Our sample comprises roughly 50 social-science papers specifically selected by The Unjournal for evaluation, not a random draw from the research literature; performance may differ in other fields or on less polished manuscripts. Human evaluations are themselves a noisy reference signal rather than ground truth, with substantial inter-rater variation that caps achievable agreement. We cannot fully rule out knowledge contamination: while we block extrinsic priors in the system prompt, the models’ training data may include fragments of these papers or related discussions. Alignment training likely contributes to score inflation and narrower credible intervals than humans provide. All LLM evaluations are single-run; aggregating across multiple runs or temperature settings could change the picture. Finally, the qualitative coverage and precision metrics are themselves LLM-assessed (GPT-5.2 Pro as judge), introducing a further layer of model dependence.

Implications. The cost structure is striking: lightweight models deliver evaluations at under a cent per paper, while reasoning-capable models cost several dollars—still orders of magnitude cheaper than human expert review. This suggests a practical tier structure in which cheap models could serve as rapid screening tools and expensive reasoners could provide deeper assessment where warranted. However, the qualitative gaps we observe—missed critiques, generic issues, and central compression of ratings—argue against full automation of peer review. AI evaluation appears most promising as a supplement: providing fast structured feedback, flagging potential concerns for human reviewers, and enabling systematic comparison across large paper sets that would be infeasible with human effort alone.

Governance and attack surface. As AI review tools move from research prototypes to deployed products, the attack surface expands. Prompt-injection techniques—embedding hidden instructions in a manuscript’s metadata, footnotes, or even white-on-white text—could steer model outputs toward inflated ratings or suppressed critiques. Because our pipeline (and similar commercial services) routes unpublished manuscripts through third-party APIs, confidentiality cannot be guaranteed without end-to-end encryption or on-premise deployment. Over-reliance on AI scores introduces a further governance risk: if editorial decisions weight model ratings, authors may optimise papers for the model rather than for scientific rigour, creating a Goodhart dynamic. Finally, current evaluations reflect a single model checkpoint; model updates, alignment changes, or fine-tuning can shift ratings in ways that are invisible to users. We recommend that any operational deployment include adversarial red-teaming of prompts, formal confidentiality agreements with API providers, transparent disclosure of AI involvement in review, and periodic re-calibration against fresh human evaluations.

Future work. Several extensions follow naturally from this analysis. Content-swap bias tests following Pataranutaporn et al. (2025) would reveal whether LLMs show systematic biases based on author names or institutional affiliations. A journal-outcome prediction horse-race would compare human and LLM tier predictions against actual publication venues. Systematic prompt and model comparisons, including agent-based approaches, could identify configurations that reduce central compression or improve qualitative coverage. Human enumerator validation—employing trained raters to independently assess whether LLMs identify the same consensus critiques—would ground the currently LLM-assessed coverage metrics. Out-of-time validation using papers entering The Unjournal’s pipeline would eliminate contamination concerns entirely. Finally, hybrid human-AI evaluation trials would test whether collaboration improves evaluation quality beyond either alone.