Validating Your LLM Judge on the PrecepTron Benchmark¶
This vignette walks through running the full PrecepTron benchmark to measure how well an LLM judge agrees with physician scores across all clinical tasks. The benchmark includes data from Goh et al.12, Brodeur et al.34, Buckley et al.56, Cabral et al.7, and Kanjee et al.8
Overview¶
The benchmark scores every entry in the PrecepTron dataset using your chosen LLM judge, then compares its scores against the ground-truth physician grades. This tells you whether you can trust a particular model as an automated scorer for clinical reasoning.
1. Install¶
git clone https://github.com/2v/Preceptron3.git
cd Preceptron3
pip install -e ".[anthropic]"
pip install pyyaml
2. Configure API Keys¶
Create benchmarking/config.ini with credentials for the providers you plan to use:
[api_keys]
openai = sk-...
openrouter = sk-or-...
anthropic = sk-ant-...
[azure]
api_key = ...
endpoint = https://your-endpoint.openai.azure.com
api_version = 2024-12-01-preview
You only need keys for the providers you will actually use.
3. Configure the Benchmark Run¶
Edit benchmarking/config.yaml (or copy from config_example.yaml) to select models, tasks, and concurrency:
dataset_path: "../score_data/combined_dataset.json"
output_dir: "../overall_judge_trials"
tasks:
- management_reasoning
- diagnostic_reasoning
- r_idea
- cpc_bond
- cpc_management
threads:
openai: 30
openrouter: 20
anthropic: 20
models:
- name: "gpt-4o"
provider: "openai"
model_id: "gpt-4o"
- name: "claude-sonnet"
provider: "anthropic"
model_id: "claude-sonnet-4-20250514"
Each model entry needs:
name-- used for output filenamesprovider-- one ofopenai,openrouter,anthropic,azuremodel_id-- the model identifier to pass to the API
Leave tasks empty or omit it to run all tasks.
4. Run the Benchmark¶
Results are saved to output_dir as {model_name}_{task_name}.json. The runner supports automatic resume -- if a run is interrupted, re-running the same command skips already-completed entries.
5. Analyze Results¶
Each output file contains the original dataset entries augmented with the LLM judge's prediction. You can compute agreement metrics (e.g. Cohen's kappa, Pearson correlation, MAE) by comparing the grade field (physician score) against the judge's predicted score.
import json
import numpy as np
with open("../overall_judge_trials/gpt-4o_cpc_bond.json") as f:
trials = json.load(f)
physician = [t["grade"] for t in trials]
predicted = [t["predicted_score"] for t in trials]
correlation = np.corrcoef(physician, predicted)[0, 1]
mae = np.mean(np.abs(np.array(physician) - np.array(predicted)))
print(f"Pearson r: {correlation:.3f}")
print(f"MAE: {mae:.2f}")
Tips¶
- Start small. Test on a single task (e.g.
cpc_bond) with one model before running the full matrix. - Compare models. Run the same config with different
model_idvalues to see which LLM judge aligns best with physicians. - Tune threads. Increase thread counts for higher throughput, but be mindful of API rate limits.
References¶
-
Goh E, et al. Large language model influence on management reasoning: a randomized trial. Nature Medicine. 2025. ↩
-
Goh E, et al. Evaluating diagnostic reasoning in large language models. JAMA Network Open. 2024. ↩
-
Brodeur P, et al. Evaluation of AI-assisted triage in clinical settings. 2024. ↩
-
Brodeur P, et al. Diagnostic and management reasoning evaluation of large language models. 2025. ↩
-
Buckley T, et al. Multimodal clinical reasoning in large language models. 2024. ↩
-
Buckley T, et al. Open-source versus closed-source large language models for clinical reasoning. 2025. ↩
-
Cabral S, et al. Automated evaluation of clinical consultations using R-IDEA. 2024. ↩
-
Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023. ↩