Mirror for Hypothes.is annotation PDF Main Site Source
Include global setup and parameters
source("setup_params.R")
#get the choices printed

This chapter describes the data sources, evaluation pipeline, prompt design, and statistical methods used throughout. All Python code underlying the pipeline is preserved in the collapsible blocks below and can be executed to reproduce the evaluation runs.

Sample and human reference data. We draw on published evaluations from The Unjournal, an open-access platform that commissions expert reviews of policy-relevant research without requiring journal submission. Each paper is typically assessed by two independent evaluators (occasionally one or three), who provide a written critique resembling a referee report together with quantitative ratings on seven criteria scored as percentiles (0–100) relative to “all serious research in the same area encountered in the last three years.”1 Evaluators additionally predict the journal tier in which the work “should” and “will” publish, using a 0–5 continuous scale anchored to familiar venue categories (0 = unpublishable, 5 = top-5 journal). For each metric, evaluators report a midpoint (median of their belief distribution) and a 90% credible interval that expresses their epistemic uncertainty.

The sample comprises working papers spanning 2017–2025 in development economics, health policy, environmental economics, and related fields that The Unjournal identified as high-impact. All papers have completed The Unjournal’s full evaluation process, meaning the authors received evaluations that have been publicly posted. For our analysis we extracted the individual evaluator scores, aggregated them by taking the arithmetic mean per paper per criterion, and retained the range of individual scores to characterise inter-rater spread.

LLM evaluation pipeline. We evaluate six frontier models spanning three providers: GPT-5 Pro and GPT-5.2 Pro (OpenAI, reasoning-capable), GPT-4o-mini (OpenAI, lightweight), Claude Sonnet 4 and Claude Opus 4.6 (Anthropic), and Gemini 2.0 Flash (Google). Each model receives the identical PDF file, system prompt, and output schema. We pass the PDF directly to the model’s native multimodal input rather than extracting text, preserving tables, figures, equations, and layout cues that ad-hoc scraping could mangle. A single API call per paper avoids hand-offs and summary loss from multi-stage pipelines.

The output is constrained by a strict JSON Schema enforcing the same nine fields that human evaluators complete: seven percentile metrics (each with midpoint, lower_bound, upper_bound) and two journal-tier predictions (each with score, ci_lower, ci_upper). Additionally, the model produces an assessment_summary of approximately 1,000 words that must precede scoring—a “think first, score second” protocol designed to ground numeric ratings in specific textual evidence. The extended schema used for our GPT-5.2 Pro focal run adds a key_issues array of concise, ranked issue statements for downstream critique comparison.

Agreement and reliability statistics. We quantify human–LLM agreement using several complementary measures. Pearson’s r captures linear association between paired ratings and is sensitive to proportional biases but not to constant offsets. Spearman’s ρ measures rank-order agreement and is robust to outliers and non-linear monotone relationships. Mean bias (LLM minus human) indicates the direction and magnitude of any systematic offset. Root mean squared error (RMSE) and mean absolute error (MAE) measure typical prediction error in the original scale units; RMSE penalises large deviations more heavily. To contextualise human–LLM agreement we report Krippendorff’s alpha (\(\alpha\)), a chance-corrected reliability coefficient that generalises across varying numbers of raters, accommodates missing data, and applies to any measurement level (nominal, ordinal, interval, or ratio). An \(\alpha\) of 1 indicates perfect agreement, 0 indicates agreement no better than chance, and values below 0 indicate systematic disagreement. We compute \(\alpha_{\text{HH}}\) (among human evaluators only) as a ceiling: if human evaluators agree with each other at only \(\alpha = 0.5\) on a given criterion, expecting an LLM to exceed that level would be unrealistic. We then report \(\alpha_{\text{HL}}\) (between the human mean and each LLM) to assess how close machine ratings come to this ceiling. For the qualitative key-issue comparison, coverage denotes the fraction of human-identified issues that received a match score \(\geq 30\%\) from the LLM judge, and precision denotes the fraction of LLM-generated issues that matched at least one human issue.

LLM evaluation pipeline setup
import os, time, json, random, hashlib
import pathlib
from typing import Any, Dict, Optional, Union

import pandas as pd
import numpy as np

import openai
from openai import OpenAI

# ---------- Configuration (in-file, no external deps)
API_KEY_PATH = pathlib.Path(os.getenv("OPENAI_KEY_PATH", "key/openai_key.txt"))
MODEL        = os.getenv("OPENAI_MODEL", "gpt-5-pro-2025-10-06")
FILE_PURPOSE = "assistants"  # for Responses API file inputs

# Run ID for organizing outputs - change this for each new evaluation run
# Set via environment variable or modify directly here
RUN_ID       = os.getenv("UJ_RUN_ID", "gpt5_pro_updated_jan2026")
RESULTS_DIR  = pathlib.Path("results")
RESULTS_DIR.mkdir(exist_ok=True)
RUN_DIR      = RESULTS_DIR / RUN_ID
RUN_DIR.mkdir(exist_ok=True)
FILE_CACHE   = RESULTS_DIR / ".file_cache.json"  # shared across runs

# ---------- API key bootstrap
if os.getenv("OPENAI_API_KEY") is None and API_KEY_PATH.exists():
    os.environ["OPENAI_API_KEY"] = API_KEY_PATH.read_text().strip()
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("No API key. Set OPENAI_API_KEY or create key/openai_key.txt")

client = OpenAI()

# ---------- Small utilities (inlined replacements for llm_utils)

def _resp_as_dict(resp: Any) -> Dict[str, Any]:
    if isinstance(resp, dict):
        return resp
    for attr in ("to_dict", "model_dump", "dict", "json"):
        if hasattr(resp, attr):
            try:
                val = getattr(resp, attr)()
                if isinstance(val, (str, bytes)):
                    try:
                        return json.loads(val)
                    except Exception:
                        pass
                if isinstance(val, dict):
                    return val
            except Exception:
                pass
    # last resort
    try:
        return json.loads(str(resp))
    except Exception:
        return {"_raw": str(resp)}

def _get_output_text(resp: Any) -> str:
    d = _resp_as_dict(resp)
    if "output_text" in d and isinstance(d["output_text"], str):
        return d["output_text"]
    out = d.get("output") or []
    chunks = []
    for item in out:
        if not isinstance(item, dict): continue
        if item.get("type") == "message":
            for c in item.get("content") or []:
                if isinstance(c, dict):
                    if "text" in c and isinstance(c["text"], str):
                        chunks.append(c["text"])
                    elif "output_text" in c and isinstance(c["output_text"], str):
                        chunks.append(c["output_text"])
    # Also check legacy top-level choices-like structures
    if not chunks:
        for k in ("content", "message"):
            v = d.get(k)
            if isinstance(v, str):
                chunks.append(v)
    return "\n".join(chunks).strip()

def _extract_json(s: str) -> Dict[str, Any]:
    """Robustly extract first top-level JSON object from a string."""
    if not s:
        raise ValueError("empty output text")
    # Fast path
    s_stripped = s.strip()
    if s_stripped.startswith("{") and s_stripped.endswith("}"):
        return json.loads(s_stripped)

    # Find first balanced {...} while respecting strings
    start = s.find("{")
    if start == -1:
        raise ValueError("no JSON object start found")
    i = start
    depth = 0
    in_str = False
    esc = False
    for i in range(start, len(s)):
        ch = s[i]
        if in_str:
            if esc:
                esc = False
            elif ch == "\\":
                esc = True
            elif ch == '"':
                in_str = False
        else:
            if ch == '"':
                in_str = True
            elif ch == "{":
                depth += 1
            elif ch == "}":
                depth -= 1
                if depth == 0:
                    candidate = s[start:i+1]
                    return json.loads(candidate)
    raise ValueError("no balanced JSON object found")

def call_with_retries(fn, max_tries: int = 6, base_delay: float = 0.8, max_delay: float = 8.0):
    ex = None
    for attempt in range(1, max_tries + 1):
        try:
            return fn()
        except (openai.RateLimitError, openai.APIError, openai.APIConnectionError, openai.APITimeoutError, Exception) as e:
            ex = e
            # Anthropic rate-limit errors need much longer backoff (large PDFs burn ~100k tokens)
            is_rate_limit = "rate_limit" in str(type(e)).lower() or "rate" in str(e).lower()
            if is_rate_limit:
                sleep = min(300, 60 * (1.5 ** (attempt - 1)))
            else:
                sleep = min(max_delay, base_delay * (1.8 ** (attempt - 1)))
            sleep *= (1 + 0.25 * random.random())
            print(f"  Retry {attempt}/{max_tries} after {sleep:.0f}s: {e}")
            time.sleep(sleep)
    raise ex

def _load_cache() -> Dict[str, Any]:
    if FILE_CACHE.exists():
        try:
            return json.loads(FILE_CACHE.read_text())
        except Exception:
            return {}
    return {}

def _save_cache(cache: Dict[str, Any]) -> None:
    FILE_CACHE.write_text(json.dumps(cache, ensure_ascii=False, indent=2))

def _file_sig(p: pathlib.Path) -> Dict[str, Any]:
    st = p.stat()
    return {"size": st.st_size, "mtime": int(st.st_mtime)}

def get_file_id(path: Union[str, pathlib.Path], client: OpenAI) -> str:
    p = pathlib.Path(path)
    if not p.exists():
        raise FileNotFoundError(p)
    cache = _load_cache()
    key = str(p.resolve())
    sig = _file_sig(p)
    meta = cache.get(key)
    if meta and meta.get("size") == sig["size"] and meta.get("mtime") == sig["mtime"] and meta.get("file_id"):
        return meta["file_id"]
    # Upload fresh
    with open(p, "rb") as fh:
      f = call_with_retries(lambda: client.files.create(file=fh, purpose=FILE_PURPOSE))
    fd = _resp_as_dict(f)
    fid = fd.get("id")
    if not fid:
        raise RuntimeError(f"Upload did not return file id: {fd}")
    cache[key] = {"file_id": fid, **sig}
    _save_cache(cache)
    return fid

def _reasoning_meta(resp) -> Dict[str, Any]:
    d = _resp_as_dict(resp)
    rid, summary_text = None, None
    out = d.get("output") or []
    if out and isinstance(out, list) and out[0].get("type") == "reasoning":
        rid = out[0].get("id")
        summ = out[0].get("summary") or []
        if summ and isinstance(summ, list):
            summary_text = summ[0].get("text")
    usage = d.get("usage") or {}
    odet  = usage.get("output_tokens_details") or {}
    return {
        "response_id": d.get("id"),
        "reasoning_id": rid,
        "reasoning_summary": summary_text,
        "input_tokens": usage.get("input_tokens"),
        "output_tokens": usage.get("output_tokens"),
        "reasoning_tokens": odet.get("reasoning_tokens"),
    }
    

def read_csv_or_empty(path, columns=None, **kwargs):
    p = pathlib.Path(path)
    if not p.exists():
        return pd.DataFrame(columns=columns or [])
    try:
        df = pd.read_csv(p, **kwargs)
        if df is None or getattr(df, "shape", (0,0))[1] == 0:
            return pd.DataFrame(columns=columns or [])
        return df
    except (pd.errors.EmptyDataError, pd.errors.ParserError, OSError, ValueError):
        return pd.DataFrame(columns=columns or [])    

JSON Schema. The output schema enforces that every paper is scored on identical fields with identical types and bounds. Credible intervals are required (paralleling the human protocol) so that the model can express genuine uncertainty rather than suggest false precision.

JSON Schema definition
METRICS = [
    "overall",
    "claims_evidence",
    "methods",
    "advancing_knowledge",
    "logic_communication",
    "open_science",
    "global_relevance",
]

metric_schema = {
    "type": "object",
    "properties": {
        "midpoint":    {"type": "number", "minimum": 0, "maximum": 100},
        "lower_bound": {"type": "number", "minimum": 0, "maximum": 100},
        "upper_bound": {"type": "number", "minimum": 0, "maximum": 100},
    },
    "required": ["midpoint", "lower_bound", "upper_bound"],
    "additionalProperties": False,
}

TIER_METRIC_SCHEMA = {
    "type": "object",
    "properties": {
        "score":   {"type": "number", "minimum": 0, "maximum": 5},
        "ci_lower":{"type": "number", "minimum": 0, "maximum": 5},
        "ci_upper":{"type": "number", "minimum": 0, "maximum": 5},
    },
    "required": ["score", "ci_lower", "ci_upper"],
    "additionalProperties": False,
}

COMBINED_SCHEMA = {
    "type": "object",
    "properties": {
        "assessment_summary": {"type": "string"},
        "metrics": {
            "type": "object",
            "properties": {
                **{m: metric_schema for m in METRICS},
                "tier_should": TIER_METRIC_SCHEMA,
                "tier_will":   TIER_METRIC_SCHEMA,
            },
            "required": METRICS + ["tier_should", "tier_will"],
            "additionalProperties": False,
        },
    },
    "required": ["assessment_summary", "metrics"],
    "additionalProperties": False,
}

TEXT_FORMAT_COMBINED = {
    "type": "json_schema",
    "name": "paper_assessment_with_tiers_v2",
    "strict": True,
    "schema": COMBINED_SCHEMA,
}

System prompt design. The system prompt is assembled from modular components and concatenated before each API call. It opens with a role definition instructing the model to act as an expert research evaluator, followed by a debiasing block that explicitly prohibits use of author identity, institutional prestige, publication venue, or any extrinsic information—the model must base all judgments on the PDF content alone. A diagnostic-summary instruction requires the model to produce a roughly 1,000-word assessment identifying methodological, evidential, and interpretive issues before any scoring, implementing a “think first, score second” protocol intended to anchor numeric ratings in specific textual evidence.

Role, debiasing, and diagnostic instructions
PROMPT_ROLE = """
Your role -- You are an academic expert as well as a practitioner across every relevant field -- use all your knowledge and insight. You are acting as an expert research evaluator/reviewer.
"""

PROMPT_DEBIASING = """
Do not look at any existing ratings or evaluations of these papers you might find on the internet or in your corpus, do not use the authors' names, status, or institutions in your judgment; ignore where (or whether) the work is published, the prestige of any venue, and how much attention it has received. Do not use this as evidence about quality. You must base all judgments entirely on the content of the PDF.
"""

PROMPT_DIAGNOSTIC = """
Diagnostic summary (Aim for about 1000 words, based only on the PDF):
Provide a compact paragraph that identifies the most important issues you detect in the manuscript itself (e.g., identification threats, data limitations, misinterpretations, internal inconsistencies, missing robustness, replication barriers). Be specific, neutral, and concrete. This summary should precede any scoring and should guide your uncertainty. Output this text in the JSON field `assessment_summary`.
"""

Percentile ratings are anchored to the reference group “all serious research in the same area encountered in the last three years,” following The Unjournal’s guidelines for evaluators. The prompt defines each of the seven criteria with emphasis on global priorities and practical relevance over pure academic novelty, mirroring the weight structure that human Unjournal evaluators are asked to apply.

Percentile scale and reference group
PROMPT_PERCENTILE_INTRO = """
We ask for a set of quantitative metrics, based on your insights. For each metric, we ask for a score and a 90% credible interval. We describe these in detail below.

Percentile rankings relative to a reference group: For some questions, we ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on. Here the population of papers should be all serious research in the same area that you have encountered in the last three years. *Unless this work is in our 'applied and policy stream', in which case the reference group should be "all applied and policy research you have read that is aiming at a similar audience, and that has similar goals".
"""

PROMPT_REFERENCE_GROUP = """
"Serious" research? Academic research?
Here, we are mainly considering research done by professional researchers with high levels of training, experience, and familiarity with recent practice, who have time and resources to devote months or years to each such research project or paper.
These will typically be written as 'working papers' and presented at academic seminars before being submitted to standard academic journals. Although no credential is required, this typically includes people with PhD degrees (or upper-level PhD students). Most of this sort of research is done by full-time academics (professors, post-docs, academic staff, etc.) with a substantial research remit, as well as research staff at think tanks and research institutions (but there may be important exceptions).

What counts as the "same area"?
This is a judgment call. Some criteria to consider... First, does the work come from the same academic field and research subfield, and does it address questions that might be addressed using similar methods? Second, does it deal with the same substantive research question, or a closely related one? If the research you are evaluating is in a very niche topic, the comparison reference group should be expanded to consider work in other areas.

"Research that you have encountered"
We are aiming for comparability across evaluators. If you suspect you are particularly exposed to higher-quality work in this category, compared to other likely evaluators, you may want to adjust your reference group downwards. (And of course vice-versa, if you suspect you are particularly exposed to lower-quality work.)
"""
Metric definitions
PROMPT_METRICS = """
Midpoint rating and credible intervals: For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty.

- "overall" - Overall assessment - Percentile ranking (0-100%): Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness, importance to knowledge production, and importance to practice.

- "claims_evidence" - Claims, strength and characterization of evidence (0-100%): Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?

- "methods" - Justification, reasonableness, validity, robustness (0-100%): Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this? Did the authors take steps to reduce bias from opportunistic reporting and questionable research practices?

- "advancing_knowledge" - Advancing our knowledge and practice (0-100%): To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions? (Applied stream: please focus on 'improvements that are actually helpful'.) Less weight to "originality and cleverness": Originality and cleverness should be weighted less than the typical journal, because we focus on impact. Papers that apply existing techniques and frameworks more rigorously than previous work or apply them to new areas in ways that provide practical insights for GP (global priorities) and interventions should be highly valued. More weight should be placed on 'contribution to GP' than on 'contribution to the academic field'.
    Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions?
    Does the project add useful value to other impactful research?
    We don't require surprising results; sound and well-presented null results can also be valuable.

- "logic_communication" - Logic and communication (0-100%): Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced? Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow? Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?

- "open_science" - Open, collaborative, replicable research (0-100%): This covers several considerations:
    - Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided? Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained?
    - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?
    - Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?

- "global_relevance" - Relevance to global priorities, usefulness for practitioners: Are the paper's chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?
"""

Subsequent prompt components instruct the model on constructing 90% credible intervals—the smallest interval the evaluator believes is 90% likely to contain the true value—encouraging calibrated uncertainty rather than artificially narrow bounds. The prompt requests journal-tier predictions on a 0–5 continuous scale anchored to familiar venue categories (0 = unpublishable through 5 = top-5 journal), providing an externally verifiable reference point for papers that eventually publish. A validation block then requires the model to verify internal consistency: numeric scores must align with the written assessment, credible intervals must be non-degenerate, and high or low ratings must be explicitly justified in the assessment summary.

Credible intervals, journal tiers, and validation
PROMPT_UNCERTAINTY = """
The midpoint and 'credible intervals': expressing uncertainty - What are we looking for and why?
- We want policymakers, researchers, funders, and managers to be able to use The Unjournal's evaluations to update their beliefs and make better decisions. To do this well, they need to weigh multiple evaluations against each other and other sources of information. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. But it's hard to quantify statements like "very certain" or "somewhat uncertain" – different people may use the same phrases to mean different things. That's why we're asking for a more precise measure: your credible intervals. These metrics are particularly useful for meta-science and meta-analysis. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value.
- How do I come up with these intervals? (Discussion and guidance): You may understand the concepts of uncertainty and credible intervals, but you might be unfamiliar with applying them in a situation like this one. You may have a certain best guess for the "Methods..." criterion. Still, even an expert can never be certain. E.g., you may misunderstand some aspect of the paper, there may be a method you are not familiar with, etc. Your uncertainty over this could be described by some distribution, representing your beliefs about the true value of this criterion. Your "best guess" should be the central mass point of this distribution. For some questions, the "true value" refers to something objective, e.g. will this work be published in a top-ranked journal? In other cases, like the percentile rankings, the true value means "if you had complete evidence, knowledge, and wisdom, what value would you choose?" If you are well calibrated your 90% credible intervals should contain the true value 90% of the time. Consider the midpoint as the 'median of your belief distribution'.
- We also ask for the 'midpoint', the center dot on that slider. Essentially, we are asking for the median of your belief distribution. By this we mean the percentile ranking such that you believe "there's a 50% chance that the paper's true rank is higher than this, and a 50% chance that it actually ranks lower than this."
"""

PROMPT_TIERS = """
Additionally, we ask: What journal ranking tier should and will this work be published in?

To help universities and policymakers make sense of our evaluations, we want to benchmark them against how research is currently judged. So, we would like you to assess the paper in terms of journal rankings. We ask for two assessments:
1. a normative judgment about 'how well the research should publish';
2. a prediction about where the research will be published.
As before, we ask for a 90% credible interval.

Journal ranking tiers are on a 0-5 scale, as follows:
    0/5: "Won't publish/little to no value". Unlikely to be cited by credible researchers
    1/5: OK/Somewhat valuable journal
    2/5: Marginal B-journal/Decent field journal
    3/5: Top B-journal/Strong field journal
    4/5: Marginal A-Journal/Top field journal
    5/5: A-journal/Top journal

- We encourage you to consider a non-integer score, e.g. 4.6 or 2.2. If a paper would be most likely to be (or merits being) published in a journal that would rank about halfway between a top tier 'A journal' and a second tier (4/5) journal, you should rate it a 4.5. Similarly, if you think it has an 80% chance of (being/meriting) publication in a 'marginal B-journal' and a 20% chance of a Top B-journal, you should rate it 2.2. Please also use this continuous scale for providing credible intervals.

- Journal ranking tier "should" (0.0-5.0)
    Assess this paper on the journal ranking scale described above, considering only its merit, giving some weight to the category metrics we discussed above. Equivalently, where would this paper be published if:
    1. the journal process was fair, unbiased, and free of noise, and that status, social connections, and lobbying to get the paper published didn't matter;
    2. journals assessed research according to the category metrics we discussed above.

- Journal ranking tier "will" (0.0-5.0)
    What if this work has already been peer reviewed and published? If this work has already been published, and you know where, please report the prediction you would have given absent that knowledge.
"""

PROMPT_VALIDATION = """
When you set the quantitative metrics:
- Treat `midpoint` as your 50% belief (the value such that you think there is a 50% chance the true value is higher and 50% chance it is lower).
- Treat `lower_bound` and `upper_bound` as an honest 90% credible interval (roughly the 5th and 95th percentiles of your belief distribution).

For all percentile metrics (0–100 scale):
- You must always satisfy: lower_bound < midpoint < upper_bound.

For the journal tier metrics (0.0–5.0):
- You must always satisfy: ci_lower < score < ci_upper.

Before finalising your JSON:
- Check that your numeric scores are consistent with your own assessment_summary. If your summary describes serious or fundamental problems with methods, evidence, or interpretation, your scores for those metrics (and for "overall") should clearly reflect that.
- Conversely, if you assign very high scores in any metric, your summary should explicitly justify why that aspect of the paper is unusually strong relative to other serious work in the field.
- If you find yourself about to make the lower and upper bounds equal to the midpoint, adjust them so they form a non-degenerate interval that honestly reflects your uncertainty. Do not be afraid to use wide credible intervals when you are genuinely uncertain.
"""

PROMPT_OUTPUT = """
Fill both top-level keys:
- `assessment_summary`: about 1000 words.
- `metrics`: object containing all required metrics.

Field names:
- Percentile metrics → `midpoint`, `lower_bound`, `upper_bound`.
- Tier metrics → `score`, `ci_lower`, `ci_upper`.

Return STRICT JSON matching the supplied schema. No preamble. No markdown. No extra text.
"""

The components above are concatenated into a single system prompt string before each API call:

Prompt assembly
SYSTEM_PROMPT_COMBINED = "\n".join([
    PROMPT_ROLE,
    PROMPT_DEBIASING,
    PROMPT_DIAGNOSTIC,
    PROMPT_PERCENTILE_INTRO,
    PROMPT_REFERENCE_GROUP,
    PROMPT_METRICS,
    PROMPT_UNCERTAINTY,
    PROMPT_TIERS,
    PROMPT_VALIDATION,
    PROMPT_OUTPUT,
]).strip()

Submission and collection. The evaluate_paper function uploads a PDF to the API and submits a background job for evaluation. File IDs are cached by path, size, and modification time so that re-running on the same PDF reuses the previously uploaded file.

Evaluation function
def evaluate_paper(pdf_path: Union[str, pathlib.Path],
                   model: Optional[str] = None,
                   use_reasoning: bool = True) -> Dict[str, Any]:
    model = model or MODEL
    fid = get_file_id(pdf_path, client)

    def _payload():
        p = dict(
            model=model,
            text={"format": TEXT_FORMAT_COMBINED},
            input=[
                {"role": "system", "content": [
                    {"type": "input_text", "text": SYSTEM_PROMPT_COMBINED}
                ]},
                {"role": "user", "content": [
                    {"type": "input_file", "file_id": fid},
                    {"type": "input_text", "text": "Return STRICT JSON per schema. No extra text."}
                ]},
            ],
            max_output_tokens=12000,
            background=True,
            store=True,
        )
        if use_reasoning:
            p["reasoning"] = {"effort": "high", "summary": "auto"}
        return p

    kickoff = call_with_retries(lambda: client.responses.create(**_payload()))
    kd = _resp_as_dict(kickoff)
    return {
        "response_id": kd.get("id"),
        "file_id": fid,
        "status": kd.get("status") or "queued",
        "model": model,
        "created_at": kd.get("created_at"),
    }

Each model receives the full PDF in a single reasoning call, avoiding hand-offs and summary loss from multi-stage ingestion. We submit one background job per paper to the OpenAI Responses API with “high” reasoning effort and server-side JSON-Schema enforcement, recording the response ID, model, file ID, status, and timestamps in a jobs index. No external sources or cross-paper material are retrieved; the evaluation is anchored entirely in the manuscript itself.

Kick off background jobs → results/jobs_index.csv
import pathlib, time

ROOT = pathlib.Path(os.getenv("UJ_PAPERS_DIR", "papers")).expanduser()
# Use RUN_DIR for isolated outputs, but jobs_index stays in results/ for monitoring
IDX  = RESULTS_DIR / "jobs_index.csv"

pdfs = sorted(ROOT.glob("*.pdf"))
print("Found PDFs:", [p.name for p in pdfs])

cols = ["paper","pdf","response_id","file_id","model","status","created_at","last_update","collected","error"]
idx = read_csv_or_empty(IDX, columns=cols)
for c in cols:
    if c not in idx.columns: idx[c] = pd.NA

existing = dict(zip(idx["paper"], idx["status"])) if not idx.empty else {}
started = []

for pdf in pdfs:
    paper = pdf.stem
    if existing.get(paper) in ("queued","in_progress","incomplete","requires_action"):
        print(f"skip {pdf.name}: job already running")
        continue
    try:
        job = evaluate_paper(pdf, model=MODEL, use_reasoning=True)
        started.append({
            "paper": paper,
            "pdf": str(pdf),
            "response_id": job.get("response_id"),
            "file_id": job.get("file_id"),
            "model": job.get("model"),
            "status": job.get("status"),
            "created_at": job.get("created_at") or pd.Timestamp.utcnow().isoformat(),
            "last_update": pd.Timestamp.utcnow().isoformat(),
            "collected": False,
            "error": pd.NA,
        })
        print(f"✓ Started job for {pdf.name}, waiting 90s before next submission...")
        time.sleep(90)  # Wait 90s between submissions to avoid TPM rate limits
    except Exception as e:
        print(f"⚠️ kickoff failed for {pdf.name}: {e}")

if started:
    idx = pd.concat([idx, pd.DataFrame(started)], ignore_index=True)
    idx.drop_duplicates(subset=["paper"], keep="last", inplace=True)
    idx.to_csv(IDX, index=False)
    print(f"Started {len(started)} jobs → {IDX}")
else:
    print("No new jobs started.")

A polling loop then checks each job’s status and, for completed jobs, retrieves the raw JSON response object and writes it to disk alongside reasoning-trace metadata (token counts, reasoning summary).

Poll status, collect completed outputs, save raw JSON only
import json, pathlib, pandas as pd

# Use RUN_DIR for outputs, jobs_index in RESULTS_DIR for monitoring
IDX = RESULTS_DIR / "jobs_index.csv"
JSN = RUN_DIR / "json"; JSN.mkdir(exist_ok=True)

def _safe_read_csv(path, columns):
    p = pathlib.Path(path)
    if not p.exists():
        return pd.DataFrame(columns=columns)
    try:
        df = pd.read_csv(p, dtype={'error': 'object', 'reasoning_id': 'object'})
    except Exception:
        return pd.DataFrame(columns=columns)
    for c in columns:
        if c not in df.columns:
            df[c] = pd.NA
    return df

cols = [
    "paper","pdf","response_id","file_id","model","status",
    "created_at","last_update","collected","error",
    "reasoning_id","input_tokens","output_tokens","reasoning_tokens",
    "reasoning_summary"
]

idx = _safe_read_csv(IDX, cols)

if idx.empty:
    print("Index is empty.")
else:
    term = {"completed","failed","cancelled","expired"}

    # 1) Refresh statuses
    for i, row in idx.iterrows():
        if str(row.get("status")) in term:
            continue
        try:
            r = client.responses.retrieve(str(row["response_id"]))
            d = _resp_as_dict(r)
            idx.at[i, "status"] = d.get("status")
            idx.at[i, "last_update"] = pd.Timestamp.utcnow().isoformat()
            if d.get("status") in term and d.get("status") != "completed":
                idx.at[i, "error"] = json.dumps(d.get("incomplete_details") or {})
        except Exception as e:
            idx.at[i, "error"] = str(e)

    # 2) Collect fresh completed outputs
    newly_done = idx[(idx["status"] == "completed") & (idx["collected"] == False)]
    print(f"Completed and pending collection: {len(newly_done)}")

    for i, row in newly_done.iterrows():
        rid   = str(row["response_id"])
        paper = str(row["paper"])
        try:
            r = client.responses.retrieve(rid)

            # save full raw response JSON
            with open(JSN / f"{paper}.response.json", "w", encoding="utf-8") as f:
                f.write(json.dumps(_resp_as_dict(r), ensure_ascii=False))

            # optional: stash reasoning meta in jobs_index
            m = _reasoning_meta(r)
            idx.at[i, "collected"]         = True
            idx.at[i, "error"]             = pd.NA
            idx.at[i, "reasoning_id"]      = m.get("reasoning_id")
            idx.at[i, "input_tokens"]      = m.get("input_tokens")
            idx.at[i, "output_tokens"]     = m.get("output_tokens")
            idx.at[i, "reasoning_tokens"]  = m.get("reasoning_tokens")
            idx.at[i, "reasoning_summary"] = m.get("reasoning_summary")

        except Exception as e:
            idx.at[i, "error"] = f"collect: {e}"

    # 3) Save updated index and print progress
    idx.to_csv(IDX, index=False)
    counts = idx["status"].value_counts(dropna=False).to_dict()
    print("Status counts:", counts)
    print(f"Progress: {counts.get('completed', 0)}/{len(idx)} completed")

Multi-model evaluation. To assess whether systematic biases or calibration differences emerge across architectures, we collect evaluations from six models spanning three providers. For OpenAI models that lack background-job or extended-reasoning support (e.g., GPT-4o-mini), we use a synchronous call variant:

OpenAI multi-model evaluation (GPT-4o, etc.)
OPENAI_MODELS = [
    # "gpt-4o-2024-11-20",
    "gpt-4o-mini-2024-07-18",  # cheaper, faster
]

def evaluate_paper_sync(pdf_path: pathlib.Path, model: str) -> Dict[str, Any]:
    """Synchronous evaluation for models without background/reasoning support."""
    fid = get_file_id(pdf_path, client)

    resp = call_with_retries(lambda: client.responses.create(
        model=model,
        text={"format": TEXT_FORMAT_COMBINED},
        input=[
            {"role": "system", "content": [
                {"type": "input_text", "text": SYSTEM_PROMPT_COMBINED}
            ]},
            {"role": "user", "content": [
                {"type": "input_file", "file_id": fid},
                {"type": "input_text", "text": "Return STRICT JSON per schema. No extra text."}
            ]},
        ],
        max_output_tokens=12000,
    ))

    return {
        "response_id": resp.id,
        "model": model,
        "output_text": _get_output_text(resp),
        "usage": _resp_as_dict(resp).get("usage", {}),
    }

def run_openai_models(pdfs: list, models: list = OPENAI_MODELS, out_dir: pathlib.Path = None):
    """Run evaluation across multiple OpenAI models."""
    out_dir = out_dir or RESULTS_DIR  # multi-model outputs go to results/{model}/
    results = []

    for model in models:
        model_dir = out_dir / model.replace("-", "_")
        model_dir.mkdir(parents=True, exist_ok=True)
        json_dir = model_dir / "json"
        json_dir.mkdir(exist_ok=True)

        for pdf in pdfs:
            paper = pdf.stem
            out_file = json_dir / f"{paper}.response.json"
            if out_file.exists():
                print(f"skip {paper} ({model}): exists")
                continue
            try:
                print(f"Evaluating {paper} with {model}...")
                result = evaluate_paper_sync(pdf, model)
                with open(out_file, "w") as f:
                    json.dump(result, f, ensure_ascii=False, indent=2)
                results.append({"paper": paper, "model": model, "status": "completed"})
                time.sleep(2)
            except Exception as e:
                print(f"Error {paper} ({model}): {e}")
                results.append({"paper": paper, "model": model, "status": "failed", "error": str(e)})

    return pd.DataFrame(results)

Anthropic. Anthropic’s API accepts PDFs as base64-encoded document content rather than uploaded file IDs, and all calls are synchronous. We use Claude’s native PDF support to preserve the same multimodal evaluation approach:

Anthropic Claude API evaluation
import anthropic
import base64

ANTHROPIC_KEY_PATH = pathlib.Path("key/anthropic_key.txt")
if ANTHROPIC_KEY_PATH.exists():
    os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_KEY_PATH.read_text().strip()

anthropic_client = anthropic.Anthropic()

ANTHROPIC_MODELS = [
    # "claude-sonnet-4-20250514",
    "claude-opus-4-6",
    # "claude-3-5-haiku-20241022",  # faster, cheaper
]

def evaluate_paper_anthropic(pdf_path: pathlib.Path, model: str) -> Dict[str, Any]:
    """Evaluate using Anthropic's Claude API with native PDF support."""
    pdf_base64 = base64.standard_b64encode(pdf_path.read_bytes()).decode("utf-8")

    resp = call_with_retries(lambda: anthropic_client.messages.create(
        model=model,
        max_tokens=12000,
        system=SYSTEM_PROMPT_COMBINED,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_base64},
                },
                {"type": "text", "text": "Return STRICT JSON per schema. No extra text."}
            ],
        }],
    ))

    output_text = resp.content[0].text if resp.content else ""
    return {
        "model": model,
        "output_text": output_text,
        "input_tokens": resp.usage.input_tokens,
        "output_tokens": resp.usage.output_tokens,
    }

def run_anthropic_models(pdfs: list, models: list = ANTHROPIC_MODELS, out_dir: pathlib.Path = None):
    """Run evaluation across Anthropic models."""
    out_dir = out_dir or RESULTS_DIR  # multi-model outputs go to results/{model}/
    results = []

    for model in models:
        model_dir = out_dir / model.replace("-", "_")
        model_dir.mkdir(parents=True, exist_ok=True)
        json_dir = model_dir / "json"
        json_dir.mkdir(exist_ok=True)

        for pdf in pdfs:
            paper = pdf.stem
            out_file = json_dir / f"{paper}.response.json"
            if out_file.exists():
                print(f"skip {paper} ({model}): exists")
                continue
            try:
                print(f"Evaluating {paper} with {model}...")
                result = evaluate_paper_anthropic(pdf, model)
                result["parsed"] = _extract_json(result["output_text"])
                with open(out_file, "w") as f:
                    json.dump(result, f, ensure_ascii=False, indent=2)
                results.append({"paper": paper, "model": model, "status": "completed"})
                # Wait based on actual token usage to respect TPM limits
                input_toks = result.get("input_tokens", 100_000)
                ANTHROPIC_TPM = 30_000  # tokens-per-minute limit for Opus 4.6
                wait_secs = max(10, (input_toks / ANTHROPIC_TPM) * 60 + 5)  # +5s safety margin
                print(f"  Rate-limit pause: {wait_secs:.0f}s ({input_toks:,} input tokens, {ANTHROPIC_TPM:,} TPM limit)")
                time.sleep(wait_secs)
            except Exception as e:
                print(f"Error {paper} ({model}): {e}")
                results.append({"paper": paper, "model": model, "status": "failed", "error": str(e)})

    return pd.DataFrame(results)

Google. Google’s Gemini API accepts PDFs via a file-upload endpoint with MIME-type tagging:

Google Gemini API evaluation
import google.generativeai as genai

GOOGLE_KEY_PATH = pathlib.Path("key/google_key.txt")
if GOOGLE_KEY_PATH.exists():
    genai.configure(api_key=GOOGLE_KEY_PATH.read_text().strip())

GOOGLE_MODELS = [
    "gemini-2.0-flash",
    # "gemini-1.5-pro",
]

def evaluate_paper_google(pdf_path: pathlib.Path, model_name: str) -> Dict[str, Any]:
    """Evaluate using Google's Gemini API."""
    uploaded_file = genai.upload_file(pdf_path, mime_type="application/pdf")

    model = genai.GenerativeModel(model_name=model_name, system_instruction=SYSTEM_PROMPT_COMBINED)
    resp = call_with_retries(lambda: model.generate_content(
        [uploaded_file, "Return STRICT JSON per schema. No extra text."],
        generation_config=genai.GenerationConfig(max_output_tokens=12000, response_mime_type="application/json"),
    ))

    try:
        genai.delete_file(uploaded_file.name)
    except Exception:
        pass

    return {
        "model": model_name,
        "output_text": resp.text or "",
        "input_tokens": getattr(resp.usage_metadata, "prompt_token_count", None),
        "output_tokens": getattr(resp.usage_metadata, "candidates_token_count", None),
    }

def run_google_models(pdfs: list, models: list = GOOGLE_MODELS, out_dir: pathlib.Path = None):
    """Run evaluation across Google models."""
    out_dir = out_dir or RESULTS_DIR  # multi-model outputs go to results/{model}/
    results = []

    for model in models:
        model_dir = out_dir / model.replace("-", "_")
        model_dir.mkdir(parents=True, exist_ok=True)
        json_dir = model_dir / "json"
        json_dir.mkdir(exist_ok=True)

        for pdf in pdfs:
            paper = pdf.stem
            out_file = json_dir / f"{paper}.response.json"
            if out_file.exists():
                print(f"skip {paper} ({model}): exists")
                continue
            try:
                print(f"Evaluating {paper} with {model}...")
                result = evaluate_paper_google(pdf, model)
                result["parsed"] = _extract_json(result["output_text"])
                with open(out_file, "w") as f:
                    json.dump(result, f, ensure_ascii=False, indent=2)
                results.append({"paper": paper, "model": model, "status": "completed"})
                time.sleep(2)
            except Exception as e:
                print(f"Error {paper} ({model}): {e}")
                results.append({"paper": paper, "model": model, "status": "failed", "error": str(e)})

    return pd.DataFrame(results)

A unified runner dispatches evaluations across all configured providers and writes per-model output directories:

Run all providers
def run_all_models(pdfs: list = None, out_dir: pathlib.Path = None):
    """Run evaluation across all configured models and providers."""
    pdfs = pdfs or sorted(pathlib.Path("papers").glob("*.pdf"))
    out_dir = out_dir or RESULTS_DIR  # multi-model outputs go to results/{model}/
    all_results = []

    for name, runner in [("openai", run_openai_models), ("anthropic", run_anthropic_models), ("google", run_google_models)]:
        print(f"\n=== {name.upper()} ===")
        try:
            df = runner(pdfs, out_dir=out_dir)
            df["provider"] = name
            all_results.append(df)
        except Exception as e:
            print(f"{name} failed: {e}")

    if all_results:
        combined = pd.concat(all_results, ignore_index=True)
        combined.to_csv(out_dir / "multi_model_summary.csv", index=False)
        return combined
    return pd.DataFrame()




run_all_models()

Focal run with key-issue extraction. For a subset of 14 papers with rich human critiques, we ran an extended evaluation using GPT-5.2 Pro. The schema for this focal run adds a key_issues array—a ranked list of concise issue statements identifying the most important methodological, interpretive, or evidential concerns—alongside the standard metrics. This structured output enables direct comparison between machine-generated and human-identified issues.

Focal run configuration
FOCAL_MODEL = "gpt-5.2-pro-2025-12-11"
FOCAL_RUN_DIR = RESULTS_DIR / "gpt52_pro_focal_jan2026"
FOCAL_RUN_DIR.mkdir(exist_ok=True)
(FOCAL_RUN_DIR / "json").mkdir(exist_ok=True)

FOCAL_PAPERS = [
    "Acemoglu_et_al._2024",
    "Adena_and_Hager_2024",
    "Benabou_et_al._2023",
    "Bilal_and_Kaenzig_2024",
    "Blimpo_and_Castaneda-Dower_2025",
    "Bruers_2021",
    "Clancy_2024",
    "Dullaghan_and_Zhang_2022",
    "Frech_et_al._2023",
    "Green_et_al._2025",
    "McGuire_et_al._2024",
    "Peterman_et_al._2025",
    "Weaver_et_al._2025",
    "Williams_et_al._2024",
]

# Extended schema with key_issues array
COMBINED_SCHEMA_WITH_ISSUES = {
    "type": "object",
    "properties": {
        "assessment_summary": {"type": "string"},
        "key_issues": {
            "type": "array",
            "items": {"type": "string"},
        },
        "metrics": {
            "type": "object",
            "properties": {
                **{m: metric_schema for m in METRICS},
                "tier_should": TIER_METRIC_SCHEMA,
                "tier_will":   TIER_METRIC_SCHEMA,
            },
            "required": METRICS + ["tier_should", "tier_will"],
            "additionalProperties": False,
        },
    },
    "required": ["assessment_summary", "key_issues", "metrics"],
    "additionalProperties": False,
}

TEXT_FORMAT_WITH_ISSUES = {
    "type": "json_schema",
    "name": "paper_assessment_with_key_issues_v1",
    "strict": True,
    "schema": COMBINED_SCHEMA_WITH_ISSUES,
}

# Extended output prompt with key_issues instruction
PROMPT_OUTPUT_WITH_ISSUES = """
Fill all three top-level keys:
- `assessment_summary`: about 1000 words.
- `key_issues`: a numbered list (array of strings) identifying the most important methodological, interpretive, or evidential issues in the paper. Each item should be a concise statement (1-2 sentences) that a reader could use as a checklist. Aim for 5-15 issues depending on the paper. Order from most to least important.
- `metrics`: object containing all required metrics.

Field names:
- Percentile metrics → `midpoint`, `lower_bound`, `upper_bound`.
- Tier metrics → `score`, `ci_lower`, `ci_upper`.

Return STRICT JSON matching the supplied schema. No preamble. No markdown. No extra text.
"""

SYSTEM_PROMPT_WITH_ISSUES = "\n".join([
    PROMPT_ROLE,
    PROMPT_DEBIASING,
    PROMPT_DIAGNOSTIC,
    PROMPT_PERCENTILE_INTRO,
    PROMPT_REFERENCE_GROUP,
    PROMPT_METRICS,
    PROMPT_UNCERTAINTY,
    PROMPT_TIERS,
    PROMPT_VALIDATION,
    PROMPT_OUTPUT_WITH_ISSUES,
]).strip()
Kick off focal paper jobs → gpt52_pro_focal_jan2026/jobs_index.csv
import time

FOCAL_IDX = FOCAL_RUN_DIR / "jobs_index.csv"
cols = ["paper","pdf","response_id","file_id","model","status","created_at","last_update","collected","error"]
idx = read_csv_or_empty(FOCAL_IDX, columns=cols)

existing = dict(zip(idx["paper"], idx["status"])) if not idx.empty else {}
started = []

for paper_name in FOCAL_PAPERS:
    pdf_path = pathlib.Path("papers") / f"{paper_name}.pdf"

    if not pdf_path.exists():
        print(f"⚠️ PDF not found: {pdf_path}")
        continue

    if existing.get(paper_name) in ("queued", "in_progress", "incomplete"):
        print(f"⏭️ Skip {paper_name}: job already running")
        continue

    if existing.get(paper_name) == "completed":
        print(f"✅ Skip {paper_name}: already completed")
        continue

    try:
        fid = get_file_id(pdf_path, client)

        kickoff = call_with_retries(lambda: client.responses.create(
            model=FOCAL_MODEL,
            text={"format": TEXT_FORMAT_WITH_ISSUES},
            input=[
                {"role": "system", "content": [
                    {"type": "input_text", "text": SYSTEM_PROMPT_WITH_ISSUES}
                ]},
                {"role": "user", "content": [
                    {"type": "input_file", "file_id": fid},
                    {"type": "input_text", "text": "Return STRICT JSON per schema. No extra text."}
                ]},
            ],
            max_output_tokens=15000,
            background=True,
            store=True,
            reasoning={"effort": "high", "summary": "detailed"},
        ))
        kd = _resp_as_dict(kickoff)

        started.append({
            "paper": paper_name,
            "pdf": str(pdf_path),
            "response_id": kd.get("id"),
            "file_id": fid,
            "model": FOCAL_MODEL,
            "status": kd.get("status") or "queued",
            "created_at": kd.get("created_at") or pd.Timestamp.utcnow().isoformat(),
            "last_update": pd.Timestamp.utcnow().isoformat(),
            "collected": False,
            "error": pd.NA,
        })
        print(f"✓ Started job for {paper_name}")
        time.sleep(90)  # Wait between submissions

    except Exception as e:
        print(f"❌ Failed {paper_name}: {e}")

if started:
    idx = pd.concat([idx, pd.DataFrame(started)], ignore_index=True)
    idx.drop_duplicates(subset=["paper"], keep="last", inplace=True)
    idx.to_csv(FOCAL_IDX, index=False)
    print(f"\n✓ Started {len(started)} jobs → {FOCAL_IDX}")
else:
    print("No new jobs started.")
Collect focal paper results
FOCAL_IDX = FOCAL_RUN_DIR / "jobs_index.csv"
FOCAL_JSON = FOCAL_RUN_DIR / "json"

idx = pd.read_csv(FOCAL_IDX, dtype={'error': 'object'})
print(f"Polling {len(idx)} focal jobs...")

for i, row in idx.iterrows():
    paper = row["paper"]
    resp_id = row["response_id"]

    if pd.isna(resp_id):
        continue

    json_path = FOCAL_JSON / f"{paper}.response.json"
    if json_path.exists() and row.get("collected") == True:
        continue

    try:
        resp = client.responses.retrieve(resp_id)
        rd = _resp_as_dict(resp)
        status = rd.get("status", "unknown")
        idx.at[i, "status"] = status
        idx.at[i, "last_update"] = pd.Timestamp.utcnow().isoformat()

        if status == "completed":
            with open(json_path, "w") as f:
                json.dump(rd, f, ensure_ascii=False, indent=2, default=str)
            idx.at[i, "collected"] = True

            m = _reasoning_meta(resp)
            idx.at[i, "input_tokens"] = m.get("input_tokens")
            idx.at[i, "output_tokens"] = m.get("output_tokens")
            idx.at[i, "reasoning_tokens"] = m.get("reasoning_tokens")
            print(f"✓ Collected: {paper}")

        elif status == "failed":
            idx.at[i, "error"] = rd.get("error", "Unknown")
            print(f"✗ Failed: {paper}")

        else:
            print(f"⏳ {status}: {paper}")

    except Exception as e:
        print(f"⚠️ Error {paper}: {e}")

idx.to_csv(FOCAL_IDX, index=False)
counts = idx["status"].value_counts().to_dict()
print(f"\nStatus: {counts}")

Key-issue comparison with human critiques. To validate how well the LLM identifies substantive concerns, we compare its key_issues output against human expert critiques drawn from The Unjournal’s Coda database. These human critiques—produced by paid domain experts and synthesized by evaluation managers—provide a high-quality but noisy reference standard for issue identification; accordingly, we treat them as a comparative signal rather than ground truth.

The comparison proceeds in two stages. First, we parse and align the data sources: LLM key issues are extracted from the focal-run JSON responses, while human critiques are drawn from a manually curated markdown document pairing each paper’s machine and expert assessments. Second, we use an LLM judge (GPT-5.2 Pro with schema-enforced structured output) to systematically assess the degree of alignment between each human-identified issue and the set of machine-generated issues, producing issue-by-issue match scores, coverage estimates (fraction of human issues captured), and precision estimates (fraction of LLM issues that are genuinely substantive).

Key issues comparison: parse markdown and run LLM assessment
import re
import time

# =============================================================================
# Configuration
# =============================================================================
KEY_ISSUES_MD_INPUT = RESULTS_DIR / "key_issues_comparison.md"
KEY_ISSUES_OUTPUT = RESULTS_DIR / "key_issues_comparison.json"

# =============================================================================
# Markdown Parser
# =============================================================================
def parse_key_issues_markdown(md_path):
    """Parse key_issues_comparison.md to extract paper data.

    The markdown has this structure for each paper:
    ## PaperName
    **Coda title:** Title
    ### GPT-5.2 Pro Key Issues
    - Issue 1
    - Issue 2
    ### Human Expert Critiques (Coda)
    Critique text...
    ---
    """
    with open(md_path, 'r', encoding='utf-8') as f:
        content = f.read()

    # Split by paper sections (## followed by paper name)
    # Pattern handles names like "Blimpo_and_Castaneda-Dower_2025" (with hyphens)
    paper_sections = re.split(r'\n## ([A-Za-z_0-9-]+(?:_et_al\.)?_\d{4})\n', content)

    matched_data = []
    for i in range(1, len(paper_sections), 2):
        if i + 1 >= len(paper_sections):
            break

        paper_name = paper_sections[i].strip()
        section_content = paper_sections[i + 1]

        # Extract Coda title
        coda_match = re.search(r'\*\*Coda title:\*\*\s*(.+?)(?:\n|$)', section_content)
        coda_title = coda_match.group(1).strip() if coda_match else ""

        # Extract GPT key issues (bullet points after "### GPT" header)
        gpt_section = re.search(
            r'### GPT[^\n]*Key Issues\s*\n(.*?)(?=\n### Human|$)',
            section_content,
            re.DOTALL
        )
        gpt_issues = []
        if gpt_section:
            bullets = re.findall(r'^- (.+)$', gpt_section.group(1), re.MULTILINE)
            gpt_issues = [b.strip() for b in bullets if b.strip()]

        # Extract human critiques (everything after "### Human Expert Critiques")
        human_section = re.search(
            r'### Human Expert Critiques[^\n]*\n(.*?)(?=\n---|\Z)',
            section_content,
            re.DOTALL
        )
        human_critique = human_section.group(1).strip() if human_section else ""

        if gpt_issues or human_critique:
            matched_data.append({
                "gpt_paper": paper_name,
                "coda_title": coda_title,
                "gpt_key_issues": gpt_issues,
                "coda_critique": human_critique,
                "num_gpt_issues": len(gpt_issues),
                "coda_critique_length": len(human_critique),
            })

    return matched_data

# =============================================================================
# LLM Comparison Function - with EXPLICIT issue-to-issue matching
# =============================================================================

# JSON Schema to ENFORCE the matched_pairs output structure
# Per OpenAI Responses API: text.format needs type/name/strict/schema at TOP LEVEL

# 1) Inner JSON Schema (the actual schema definition)
ISSUE_COMPARISON_SCHEMA = {
    "type": "object",
    "properties": {
        "matched_pairs": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "human_issue_index": {"type": "integer", "minimum": 1},
                    "human_issue_text": {"type": "string"},
                    "llm_issue_indices": {"type": "array", "items": {"type": "integer", "minimum": 1}},
                    "match_quality": {"type": "integer", "minimum": 0, "maximum": 100},
                    "label": {"type": "string"},
                    "match_explanation": {"type": "string"},
                    "detailed_discussion": {"type": "string"},
                },
                "required": ["human_issue_index", "human_issue_text", "llm_issue_indices", "match_quality", "label", "match_explanation", "detailed_discussion"],
                "additionalProperties": False,
            },
        },
        "unmatched_human": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "index": {"type": "integer", "minimum": 1},
                    "brief_description": {"type": "string"},
                    "why_missed": {"type": "string"},
                },
                "required": ["index", "brief_description", "why_missed"],
                "additionalProperties": False,
            },
        },
        "unmatched_llm": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "index": {"type": "integer", "minimum": 1},
                    "brief_description": {"type": "string"},
                    "why_extra": {"type": "string"},
                },
                "required": ["index", "brief_description", "why_extra"],
                "additionalProperties": False,
            },
        },
        "coverage_pct": {"type": "integer", "minimum": 0, "maximum": 100},
        "precision_pct": {"type": "integer", "minimum": 0, "maximum": 100},
        "overall_rating": {"type": "string", "enum": ["Excellent", "Good", "Moderate", "Poor"]},
        "overall_justification": {"type": "string"},
        "detailed_notes": {"type": "string"},
    },
    "required": ["matched_pairs", "unmatched_human", "unmatched_llm", "coverage_pct", "precision_pct", "overall_rating", "overall_justification", "detailed_notes"],
    "additionalProperties": False,
}

# 2) Responses API text.format wrapper (THIS is what goes into text={"format": ...})
COMPARISON_TEXT_FORMAT = {
    "type": "json_schema",
    "name": "issue_comparison",
    "strict": True,
    "schema": ISSUE_COMPARISON_SCHEMA,
}

COMPARISON_PROMPT_TEMPLATE = """You are comparing human expert critiques with LLM-identified issues for a research paper evaluation.

## Task
Create a detailed issue-by-issue comparison. For each human issue, identify which LLM issue(s) cover it.

## Human Expert Issues (numbered H1, H2, ...)
{human_critique}

## LLM Issues (numbered L1, L2, ...)
{gpt_issues}

## Instructions
1. For each human issue, identify which LLM issue(s) address the same or related concern
2. Create a matched_pairs entry for each human issue that has any LLM coverage, with:
   - A short descriptive LABEL for the shared concern (5-10 words)
   - match_quality: 0-100% where 100% = exact same concern
   - match_explanation: 1-2 sentence explanation of WHY this is a match
   - detailed_discussion: 3-5 sentences comparing how human vs LLM framed the issue
3. List unmatched_human issues (not captured by LLM) with brief_description and why_missed
4. List unmatched_llm issues (not in human critique) with brief_description and why_extra
5. Calculate coverage_pct (% of human issues with match_quality >= 30) and precision_pct (% of LLM issues that match)

Be precise about which issues match. Related but distinct concerns should have lower match_quality scores (40-60%) with explanation of the distinction."""

def compare_issues_with_llm(paper_name, coda_critique, gpt_issues):
    """Use LLM to compare the critiques with explicit issue-to-issue matching."""
    # Input validation
    if not gpt_issues:
        return {"error": "No GPT issues", "matched_pairs": [], "unmatched_human": [], "unmatched_llm": [],
                "coverage_pct": None, "precision_pct": None, "overall_rating": "N/A",
                "overall_justification": "", "detailed_notes": ""}
    if not coda_critique or len(coda_critique.strip()) < 20:
        return {"error": "No human critique", "matched_pairs": [], "unmatched_human": [], "unmatched_llm": [],
                "coverage_pct": None, "precision_pct": None, "overall_rating": "N/A",
                "overall_justification": "", "detailed_notes": ""}

    # Format inputs with numbered labels
    gpt_issues_text = "\n".join(f"L{i+1}: {issue}" for i, issue in enumerate(gpt_issues))
    prompt_text = COMPARISON_PROMPT_TEMPLATE.format(
        human_critique=coda_critique,
        gpt_issues=gpt_issues_text
    )

    # Call LLM using Responses API with SCHEMA ENFORCEMENT
    response = None
    error_msg = None
    try:
        response = call_with_retries(lambda: client.responses.create(
            model=FOCAL_MODEL,
            text={"format": COMPARISON_TEXT_FORMAT},  # Schema enforces matched_pairs structure
            input=[
                {"role": "user", "content": [
                    {"type": "input_text", "text": prompt_text}
                ]}
            ],
            reasoning={"effort": "medium", "summary": "auto"},
            max_output_tokens=6000,  # Increased for detailed responses
        ))
    except Exception as exc:
        error_msg = str(exc)
        print(f"  LLM error: {error_msg}")

    # Parse response - extract output text from responses API format
    if response is not None:
        try:
            output_text = None
            for block in response.output:
                if block.type == "message":
                    for content in block.content:
                        if content.type == "output_text":
                            output_text = content.text
                            break
            if output_text:
                return json.loads(output_text)
            else:
                error_msg = "No output text in response"
                print(f"  {error_msg}")
        except Exception as parse_exc:
            error_msg = f"Parse error: {parse_exc}"
            print(f"  {error_msg}")

    return {"error": error_msg or "Unknown error", "matched_pairs": [], "unmatched_human": [], "unmatched_llm": [],
            "coverage_pct": None, "precision_pct": None, "overall_rating": "Error",
            "overall_justification": "", "detailed_notes": ""}

# =============================================================================
# Step 1: Parse markdown and save JSON
# =============================================================================
if not KEY_ISSUES_MD_INPUT.exists():
    raise FileNotFoundError(f"Input markdown not found: {KEY_ISSUES_MD_INPUT}")

matched_data = parse_key_issues_markdown(KEY_ISSUES_MD_INPUT)

print(f"Parsed {len(matched_data)} papers from {KEY_ISSUES_MD_INPUT.name}")
for item in matched_data:
    print(f"  - {item['gpt_paper']}: {item['num_gpt_issues']} GPT issues, {item['coda_critique_length']} chars human critique")

KEY_ISSUES_OUTPUT.parent.mkdir(parents=True, exist_ok=True)
with open(KEY_ISSUES_OUTPUT, 'w') as f:
    json.dump(matched_data, f, indent=2, ensure_ascii=False)
print(f"\nJSON saved to: {KEY_ISSUES_OUTPUT}")

# =============================================================================
# Step 2: Run LLM comparison on each paper
# =============================================================================
print("\n" + "="*60)
print("Running LLM comparison...")
comparison_results = []

for item in matched_data:
    paper = item['gpt_paper']
    print(f"\nComparing: {paper}")

    comparison = compare_issues_with_llm(
        paper, item['coda_critique'], item['gpt_key_issues']
    )

    comparison_results.append({
        **item,
        "comparison": comparison
    })

    if comparison.get('coverage_pct') is not None:
        n_matched = len(comparison.get('matched_pairs', []))
        n_unmatched_h = len(comparison.get('unmatched_human', []))
        n_unmatched_l = len(comparison.get('unmatched_llm', []))
        print(f"  Coverage: {comparison['coverage_pct']}%, Precision: {comparison['precision_pct']}%, Rating: {comparison['overall_rating']}")
        print(f"  Matched pairs: {n_matched}, Unmatched human: {n_unmatched_h}, Unmatched LLM: {n_unmatched_l}")
    else:
        print(f"  Skipped or error: {comparison.get('error', 'unknown')}")

    time.sleep(2)  # Rate limiting

# Save results with comparison
comparison_output = KEY_ISSUES_OUTPUT.with_name('key_issues_comparison_results.json')
with open(comparison_output, 'w') as f:
    json.dump(comparison_results, f, indent=2, ensure_ascii=False)
print(f"\nComparison results saved to: {comparison_output}")

# =============================================================================
# Step 3: Summary statistics
# =============================================================================
valid_results = [r for r in comparison_results if r['comparison'].get('coverage_pct') is not None]
if valid_results:
    avg_coverage = sum(r['comparison']['coverage_pct'] for r in valid_results) / len(valid_results)
    avg_precision = sum(r['comparison']['precision_pct'] for r in valid_results) / len(valid_results)
    ratings = [r['comparison']['overall_rating'] for r in valid_results]

    print(f"\n{'='*60}")
    print(f"SUMMARY ({len(valid_results)}/{len(comparison_results)} papers with valid comparisons)")
    print(f"Average Coverage: {avg_coverage:.1f}%")
    print(f"Average Precision: {avg_precision:.1f}%")
    print(f"Rating distribution: {dict((r, ratings.count(r)) for r in set(ratings))}")
else:
    print("\nNo valid comparison results to summarize.")

The comparison pipeline takes a manually curated markdown file pairing each paper’s LLM issues with human critiques and produces two structured outputs: a parsed JSON dataset (key_issues_comparison.json) and an LLM-assessed alignment report (key_issues_comparison_results.json) containing per-paper coverage, precision, matched-pair explanations, and an overall quality rating. Together, these outputs feed the qualitative analysis presented in Appendix B: Critiques & Key Issues.


  1. The seven criteria are: overall assessment, claims and evidence, methods, advancing knowledge and practice, logic and communication, open and collaborative science, and relevance to global priorities. Full definitions are given in The Unjournal’s guidelines for evaluators.↩︎