Benchmarking a New Model Against a Physician Baseline¶

This vignette demonstrates how to take a new model -- here, the open-source medical LLM Med42-v2 -- run it on the clinical cases from Brodeur et al.¹², autoscore the outputs with an LLM judge, and compare performance against the original physician baseline. This is a proof of concept for how future studies can automatically benchmark against prior physician-graded data using PrecepTron.

Overview¶

Generate predictions -- Run Med42 on every case from the Brodeur diagnostic/management benchmark.
Autograder -- Score Med42's responses using GPT-5 as the LLM judge.
Compare -- Plot Med42 alongside the original study models and physician scores.
Validate the judge -- Check GPT-5's agreement with the physician ground truth for robustness.

1. Setup¶

git clone https://github.com/2v/Preceptron3.git
cd Preceptron3
pip install -e ".[anthropic]"
pip install pyyaml vllm

Create benchmarking/config.ini with your API keys:

[api_keys]
openai = sk-...

Serve Med42 locally¶

Med42-v2 is a medical fine-tune of Llama 3 8B. Serve it with vLLM so it exposes an OpenAI-compatible endpoint:

python -m vllm.entrypoints.openai.api_server \
    --model m42-health/Llama3-Med42-8B \
    --port 8000

This gives you a local endpoint at http://localhost:8000/v1 that PrecepTron can call like any other OpenAI-compatible API.

2. Generate Predictions on Brodeur Cases¶

First, extract the Brodeur cases and run Med42 on each one. The cases include both diagnostic reasoning (generate a differential diagnosis) and management reasoning (answer case-specific management questions).

import json
from openai import OpenAI

# Load the full benchmark dataset
with open("score_data/combined_dataset.json") as f:
    dataset = json.load(f)

# Filter to Brodeur studies
brodeur = [e for e in dataset if e["study"].startswith("brodeur")]
print(f"Brodeur entries: {len(brodeur)}")
# Brodeur entries: 1403

# Connect to the local Med42 server
med42 = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

Run inference on the diagnostic reasoning cases:

dx_cases = [e for e in brodeur if e["benchmark"] == "diagnostic_reasoning"]

predictions = []
for entry in dx_cases:
    prompt = (
        f"You are an expert clinician. Given the following case, provide your "
        f"differential diagnosis and reasoning.\n\n"
        f"Case: {entry['case_vignette']}\n\n"
        f"Question: {entry['question_text']}"
    )
    resp = med42.chat.completions.create(
        model="m42-health/Llama3-Med42-8B",
        messages=[{"role": "user", "content": prompt}],
    )
    entry_copy = entry.copy()
    entry_copy["response"] = resp.choices[0].message.content
    entry_copy["model"] = "med42-8b"
    predictions.append(entry_copy)

with open("med42_brodeur_diagnostic.json", "w") as f:
    json.dump(predictions, f, indent=2)

Repeat for management reasoning cases (these already have per-question rubrics in the dataset):

mgmt_cases = [e for e in brodeur if e["benchmark"] == "management_reasoning"]

mgmt_predictions = []
for entry in mgmt_cases:
    prompt = (
        f"You are an expert clinician.\n\n"
        f"Case: {entry['case_vignette']}\n\n"
        f"Question: {entry['question_text']}\n\n"
        f"Provide your answer."
    )
    resp = med42.chat.completions.create(
        model="m42-health/Llama3-Med42-8B",
        messages=[{"role": "user", "content": prompt}],
    )
    entry_copy = entry.copy()
    entry_copy["response"] = resp.choices[0].message.content
    entry_copy["model"] = "med42-8b"
    mgmt_predictions.append(entry_copy)

with open("med42_brodeur_management.json", "w") as f:
    json.dump(mgmt_predictions, f, indent=2)

3. Autoscore with PrecepTron¶

Now use GPT-5 as the LLM judge to score Med42's responses against the same rubrics the physicians used.

Score diagnostic reasoning¶

from preceptron import score
from openai import OpenAI

judge = OpenAI()  # uses OPENAI_API_KEY

with open("med42_brodeur_diagnostic.json") as f:
    predictions = json.load(f)

for entry in predictions:
    result = score(
        task="diagnostic_reasoning",
        response=entry["response"],
        final_diagnosis=entry["final_diagnosis"],
        case_vignette=entry["case_vignette"],
        question_text=entry["question_text"],
        model="gpt-5",
        client=judge,
    )
    entry["predicted_score"] = result["score"]
    entry["justification"] = result["justification"]

with open("med42_brodeur_diagnostic_graded.json", "w") as f:
    json.dump(predictions, f, indent=2)

Score management reasoning¶

with open("med42_brodeur_management.json") as f:
    predictions = json.load(f)

for entry in predictions:
    result = score(
        task="management_reasoning",
        response=entry["response"],
        case_vignette=entry["case_vignette"],
        question_text=entry["question_text"],
        rubric=entry["rubric"],
        model="gpt-5",
        client=judge,
    )
    entry["predicted_score"] = result["score"]
    entry["justification"] = result["justification"]

with open("med42_brodeur_management_graded.json", "w") as f:
    json.dump(predictions, f, indent=2)

4. Compare to the Original Benchmark¶

Load the original physician-graded scores for the models from the Brodeur study, add Med42, and plot.

import json
import numpy as np
import matplotlib.pyplot as plt

# Original Brodeur data (physician-graded)
with open("score_data/combined_dataset.json") as f:
    dataset = json.load(f)

brodeur_dx = [e for e in dataset
              if e["study"].startswith("brodeur")
              and e["benchmark"] == "diagnostic_reasoning"]

# Med42 autograded data
with open("med42_brodeur_diagnostic_graded.json") as f:
    med42_dx = json.load(f)

# Group scores by model
from collections import defaultdict
scores_by_model = defaultdict(list)
for e in brodeur_dx:
    scores_by_model[e["model"]].append(e["grade"])
scores_by_model["Med42-8B (autograded)"] = [
    e["predicted_score"] for e in med42_dx if e["predicted_score"] is not None
]

# Boxplot
fig, ax = plt.subplots(figsize=(10, 6))
labels = list(scores_by_model.keys())
data = [scores_by_model[m] for m in labels]
ax.boxplot(data, labels=labels, vert=True)
ax.set_ylabel("Score")
ax.set_title("Diagnostic Reasoning — Brodeur et al. Benchmark")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.savefig("brodeur_comparison_boxplot.png", dpi=150)
plt.show()

Placeholder

Replace the generated plot above with the publication figure including Med42 as a comparator. Save the figure to docs/assets/brodeur_comparison_boxplot.png and uncomment the line below.

5. Validate the Autograder¶

To trust the autoscored results, we need to verify that GPT-5 agrees with the physician ground truth on the original Brodeur data. Run the benchmark's existing graded data through the judge and measure agreement.

import json
import numpy as np
from scipy import stats

# Run GPT-5 judge on the original Brodeur data (already done via the benchmark runner)
# cd benchmarking && python run_benchmark.py --config config.yaml --keys config.ini

with open("overall_judge_trials/gpt-5_diagnostic_reasoning.json") as f:
    trials = json.load(f)

brodeur = [t for t in trials if t["study"].startswith("brodeur")]

physician = np.array([t["grade"] for t in brodeur])
judge = np.array([t["predicted_score"] for t in brodeur])

mae = np.mean(np.abs(physician - judge))
pearson_r, p_value = stats.pearsonr(physician, judge)

print(f"GPT-5 Judge vs Physician (Brodeur diagnostic)")
print(f"  N = {len(brodeur)}")
print(f"  MAE = {mae:.2f}")
print(f"  Pearson r = {pearson_r:.3f} (p = {p_value:.1e})")

# Same for management reasoning
with open("overall_judge_trials/gpt-5_management_reasoning.json") as f:
    trials = json.load(f)

brodeur_mgmt = [t for t in trials if t["study"].startswith("brodeur")]

physician = np.array([t["grade"] for t in brodeur_mgmt])
judge = np.array([t["predicted_score"] for t in brodeur_mgmt])

mae = np.mean(np.abs(physician - judge))
pearson_r, p_value = stats.pearsonr(physician, judge)

print(f"GPT-5 Judge vs Physician (Brodeur management)")
print(f"  N = {len(brodeur_mgmt)}")
print(f"  MAE = {mae:.2f}")
print(f"  Pearson r = {pearson_r:.3f} (p = {p_value:.1e})")

Strong agreement (high Pearson r, low MAE) gives confidence that the autoscored Med42 results are meaningful -- and that this workflow generalizes to any new model.

Summary¶

This vignette demonstrated the full loop:

Serve a new model (Med42 via vLLM) and generate predictions on Brodeur cases.
Autoscore the predictions using PrecepTron with GPT-5 as the judge.
Compare the new model against the original physician-graded baseline in a boxplot.
Validate the autograder's agreement with physicians for robustness.

This workflow lets any researcher benchmark a new clinical LLM against established physician baselines without needing new human evaluation -- PrecepTron's autograder and curated dataset handle it automatically.

References¶

Brodeur P, et al. Evaluation of AI-assisted triage in clinical settings. 2024. ↩
Brodeur P, et al. Diagnostic and management reasoning evaluation of large language models. 2025. ↩