Skip to content

PrecepTron

Can you trust an LLM to grade clinical reasoning like a physician?

PrecepTron is an open benchmark for answering that question. We release thousands of real physician-scored responses from published clinical studies — spanning differential diagnosis, management planning, and consultation quality — so you can measure whether your LLM judge actually agrees with expert clinicians.

Hugging Face Dataset GitHub arXiv


5,250

Scored responses

9,821

Physician scores

121

Physicians in baseline

11

Physician evaluators

6

Clinical tasks

7

Source studies


What is PrecepTron?

LLMs are increasingly used as automated graders for medical AI research — but how do you know the grader itself is reliable? PrecepTron gives you a way to find out.

We collected physician scores from studies at Beth Israel Deaconess Medical Center (BIDMC) and other institutions, covering tasks like differential diagnosis, clinical management, and consultation quality. With this ground truth, you can:

  • Validate any LLM judge against real physician scores across multiple clinical tasks.
  • Benchmark new models against physician baselines from prior studies, without needing new human evaluation.
  • Score clinical responses with a single Python function call using standardized, published rubrics.
  • Use optimized judge prompts that we release alongside the benchmark, tuned via GEPA to maximize agreement with physician scores.

For detailed breakdowns of the dataset by task, study, and model, see the Dataset page.


Get Started

Install PrecepTron and score a clinical response in under a minute:

pip install git+https://github.com/2v/Preceptron3.git
from openai import OpenAI
from preceptron import score

client = OpenAI()

result = score(
    task="cpc_bond",
    response="1. Pheochromocytoma\n2. Thyroid storm\n3. Carcinoid syndrome",
    final_diagnosis="Pheochromocytoma",
    model="gpt-4o",
    client=client,
)

print(result["score"])          # 5
print(result["justification"])  # LLM's reasoning

Works with OpenAI, Anthropic, Azure, OpenRouter, or any OpenAI-compatible API.


Vignettes

Step-by-step guides for common workflows:


Source Studies

PrecepTron aggregates physician-scored data from the following published studies:

Study Title Tasks Paper
Goh et al. 2025 GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial Management reasoning Nature Medicine
Goh et al. 2024 Large language model influence on diagnostic reasoning: a randomized clinical trial Diagnostic reasoning JAMA Network Open
Brodeur et al. 2025 Superhuman performance of a large language model on the reasoning tasks of a physician Diagnostic reasoning, management reasoning, CPC Bond, BI triage arXiv
Buckley et al. 2024 Multimodal foundation models exploit text to make medical image predictions CPC Bond, CPC management arXiv
Buckley et al. 2025 Comparison of frontier open-source and proprietary large language models for complex diagnoses CPC Bond JAMA Health Forum
Cabral et al. 2024 Clinical reasoning of a generative artificial intelligence model compared with physicians R-IDEA JAMA Internal Medicine
Kanjee et al. 2023 Accuracy of a generative artificial intelligence model in a complex diagnostic challenge CPC Bond JAMA

Citation

@article{preceptron2025,
  title={PrecepTron: A Benchmark for Validating LLM-as-a-Judge
         on Clinical Reasoning Tasks},
  author={Buckley, Thomas and others},
  journal={arXiv preprint},
  year={2025}
}