Skip to content

Validating Your LLM Judge on the PrecepTron Benchmark

This vignette walks through running the full PrecepTron benchmark to measure how well an LLM judge agrees with physician scores across all clinical tasks. The benchmark includes data from Goh et al.12, Brodeur et al.34, Buckley et al.56, Cabral et al.7, and Kanjee et al.8

Overview

The benchmark scores every entry in the PrecepTron dataset using your chosen LLM judge, then compares its scores against the ground-truth physician grades. This tells you whether you can trust a particular model as an automated scorer for clinical reasoning.

1. Install

git clone https://github.com/2v/Preceptron3.git
cd Preceptron3
pip install -e ".[anthropic]"
pip install pyyaml

2. Configure API Keys

Create benchmarking/config.ini with credentials for the providers you plan to use:

[api_keys]
openai = sk-...
openrouter = sk-or-...
anthropic = sk-ant-...

[azure]
api_key = ...
endpoint = https://your-endpoint.openai.azure.com
api_version = 2024-12-01-preview

You only need keys for the providers you will actually use.

3. Configure the Benchmark Run

Edit benchmarking/config.yaml (or copy from config_example.yaml) to select models, tasks, and concurrency:

dataset_path: "../score_data/combined_dataset.json"
output_dir: "../overall_judge_trials"

tasks:
  - management_reasoning
  - diagnostic_reasoning
  - r_idea
  - cpc_bond
  - cpc_management

threads:
  openai: 30
  openrouter: 20
  anthropic: 20

models:
  - name: "gpt-4o"
    provider: "openai"
    model_id: "gpt-4o"
  - name: "claude-sonnet"
    provider: "anthropic"
    model_id: "claude-sonnet-4-20250514"

Each model entry needs:

  • name -- used for output filenames
  • provider -- one of openai, openrouter, anthropic, azure
  • model_id -- the model identifier to pass to the API

Leave tasks empty or omit it to run all tasks.

4. Run the Benchmark

cd benchmarking
python run_benchmark.py --config config.yaml --keys config.ini

Results are saved to output_dir as {model_name}_{task_name}.json. The runner supports automatic resume -- if a run is interrupted, re-running the same command skips already-completed entries.

5. Analyze Results

Each output file contains the original dataset entries augmented with the LLM judge's prediction. You can compute agreement metrics (e.g. Cohen's kappa, Pearson correlation, MAE) by comparing the grade field (physician score) against the judge's predicted score.

import json
import numpy as np

with open("../overall_judge_trials/gpt-4o_cpc_bond.json") as f:
    trials = json.load(f)

physician = [t["grade"] for t in trials]
predicted = [t["predicted_score"] for t in trials]

correlation = np.corrcoef(physician, predicted)[0, 1]
mae = np.mean(np.abs(np.array(physician) - np.array(predicted)))

print(f"Pearson r: {correlation:.3f}")
print(f"MAE: {mae:.2f}")

Tips

  • Start small. Test on a single task (e.g. cpc_bond) with one model before running the full matrix.
  • Compare models. Run the same config with different model_id values to see which LLM judge aligns best with physicians.
  • Tune threads. Increase thread counts for higher throughput, but be mindful of API rate limits.

References


  1. Goh E, et al. Large language model influence on management reasoning: a randomized trial. Nature Medicine. 2025. 

  2. Goh E, et al. Evaluating diagnostic reasoning in large language models. JAMA Network Open. 2024. 

  3. Brodeur P, et al. Evaluation of AI-assisted triage in clinical settings. 2024. 

  4. Brodeur P, et al. Diagnostic and management reasoning evaluation of large language models. 2025. 

  5. Buckley T, et al. Multimodal clinical reasoning in large language models. 2024. 

  6. Buckley T, et al. Open-source versus closed-source large language models for clinical reasoning. 2025. 

  7. Cabral S, et al. Automated evaluation of clinical consultations using R-IDEA. 2024. 

  8. Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023.