Validating Your LLM Judge on the PrecepTron Benchmark¶

This vignette walks through running the full PrecepTron benchmark to measure how well an LLM judge agrees with physician scores across all clinical tasks. The benchmark includes data from Goh et al.¹², Brodeur et al.³, Buckley et al.⁴⁵, Cabral et al.⁶, and Kanjee et al.⁷.

Overview¶

The benchmark scores every entry in the PrecepTron dataset using your chosen LLM judge, then compares its scores against the ground-truth physician grades. This tells you whether you can trust a particular model as an automated scorer for clinical reasoning.

1. Install¶

git clone https://github.com/2v/Preceptron3.git
cd Preceptron3
pip install -e ".[anthropic]"
pip install pyyaml

2. Configure API Keys¶

Create benchmarking/config.ini with credentials for the providers you plan to use:

[api_keys]
openai = sk-...
openrouter = sk-or-...
anthropic = sk-ant-...

[azure]
api_key = ...
endpoint = https://your-endpoint.openai.azure.com
api_version = 2024-12-01-preview

You only need keys for the providers you will actually use.

3. Configure the Benchmark Run¶

Edit benchmarking/config.yaml (or copy from config_example.yaml) to select models, tasks, and concurrency:

dataset_path: "../score_data/combined_dataset.json"
output_dir: "../overall_judge_trials"

tasks:
  - management_reasoning
  - diagnostic_reasoning
  - r_idea
  - cpc_bond
  - cpc_management

threads:
  openai: 30
  openrouter: 20
  anthropic: 20

models:
  - name: "gpt-4o"
    provider: "openai"
    model_id: "gpt-4o"
  - name: "claude-sonnet"
    provider: "anthropic"
    model_id: "claude-sonnet-4-20250514"

Each model entry needs:

name -- used for output filenames
provider -- one of openai, openrouter, anthropic, azure
model_id -- the model identifier to pass to the API

Leave tasks empty or omit it to run all tasks.

4. Run the Benchmark¶

cd benchmarking
python run_benchmark.py --config config.yaml --keys config.ini

Results are saved to output_dir as {model_name}_{task_name}.json. The runner supports automatic resume -- if a run is interrupted, re-running the same command skips already-completed entries.

5. Analyze Results¶

Each output file contains the original dataset entries augmented with the LLM judge's prediction. You can compute agreement metrics (e.g. Cohen's kappa, Pearson correlation, MAE) by comparing the grade field (physician score) against the judge's predicted score.

import json
import numpy as np

with open("../overall_judge_trials/gpt-4o_cpc_bond.json") as f:
    trials = json.load(f)

physician = [t["grade"] for t in trials]
predicted = [t["predicted_score"] for t in trials]

correlation = np.corrcoef(physician, predicted)[0, 1]
mae = np.mean(np.abs(np.array(physician) - np.array(predicted)))

print(f"Pearson r: {correlation:.3f}")
print(f"MAE: {mae:.2f}")

Tips¶

Start small. Test on a single task (e.g. cpc_bond) with one model before running the full matrix.
Compare models. Run the same config with different model_id values to see which LLM judge aligns best with physicians.
Tune threads. Increase thread counts for higher throughput, but be mindful of API rate limits.

References¶

Goh E, et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nature Medicine. 2025. doi:10.1038/s41591-024-03456-y ↩
Goh E, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Network Open. 2024. doi:10.1001/jamanetworkopen.2024.40969 ↩
Brodeur P, et al. Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv. 2025. arXiv:2412.10849 ↩
Buckley T, et al. Multimodal foundation models exploit text to make medical image predictions. arXiv. 2024. arXiv:2311.05591 ↩
Buckley T, et al. Comparison of frontier open-source and proprietary large language models for complex diagnoses. JAMA Health Forum. 2025. doi:10.1001/jamahealthforum.2025.0040 ↩
Cabral S, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Internal Medicine. 2024. doi:10.1001/jamainternmed.2024.0295 ↩
Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023. doi:10.1001/jama.2023.8288 ↩