Using an LLM Judge for One of Our Tasks¶

This vignette shows how to use PrecepTron's score() function to evaluate a single clinical response with an LLM judge, using any of the supported tasks and rubrics.

Install¶

pip install git+https://github.com/2v/Preceptron3.git

Add Anthropic support if needed:

pip install "preceptron[anthropic] @ git+https://github.com/2v/Preceptron3.git"

Score a Differential Diagnosis (Bond Score)¶

The Bond score¹ rates a differential diagnosis on a 0--5 scale based on whether the correct diagnosis appears and how it is ranked.

from openai import OpenAI
from preceptron import score

client = OpenAI()  # uses OPENAI_API_KEY

result = score(
    task="cpc_bond",
    response="1. Pheochromocytoma\n2. Thyroid storm\n3. Carcinoid syndrome",
    final_diagnosis="Pheochromocytoma",
    model="gpt-4o",
    client=client,
)

print(result["score"])          # 5
print(result["justification"])  # why the judge gave that score

Score a Testing Plan¶

result = score(
    task="cpc_management",
    response="Order 24-hour urine catecholamines and metanephrines, CT abdomen with contrast",
    test_plan="24-hour urine catecholamines",
    model="gpt-4o",
    client=client,
)

Score Consultation Quality (R-IDEA²)¶

result = score(
    task="r_idea",
    response="The patient presents with acute chest pain radiating to the back...",
    question_text="Evaluate the quality of this emergency consultation.",
    model="gpt-4o",
    client=client,
)

Score Diagnostic Reasoning¶

The diagnostic reasoning task uses a multi-axis rubric (0--19)³ that evaluates the breadth and depth of clinical reasoning.

result = score(
    task="diagnostic_reasoning",
    response="Given the history of progressive dyspnea and bilateral infiltrates...",
    final_diagnosis="Pulmonary alveolar proteinosis",
    case_vignette="A 45-year-old construction worker presents with...",
    question_text="What is the most likely diagnosis?",
    model="gpt-4o",
    client=client,
)

Score a Management Plan (Custom Rubric)¶

Management reasoning requires a case-specific rubric because each question has different expected elements:

result = score(
    task="management_reasoning",
    response="Start IV heparin, obtain CT angiography of the chest...",
    case_vignette="A 72 year-old woman admitted for gallstone pancreatitis...",
    question_text="What is your initial management plan?",
    rubric={
        "question_text": "What is your initial management plan?",
        "max_score": 5,
        "rubric_items": [
            {"points": 1, "text": "Anticoagulation"},
            {"points": 1, "text": "CT angiography"},
            {"points": 1, "text": "Echocardiography"},
            {"points": 1, "text": "IVC filter consideration"},
            {"points": 1, "text": "Oxygen supplementation"},
        ],
    },
    model="gpt-4o",
    client=client,
)

Using Anthropic Models¶

Swap the client -- everything else stays the same:

from anthropic import Anthropic
from preceptron import score

client = Anthropic()  # uses ANTHROPIC_API_KEY

result = score(
    task="cpc_bond",
    response="1. Anti-IgLON5 disease\n2. Narcolepsy",
    final_diagnosis="Anti-IgLON5-associated neurologic disorder",
    model="claude-sonnet-4-20250514",
    client=client,
)

Using OpenRouter or Other OpenAI-Compatible APIs¶

from openai import OpenAI

client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="sk-or-...")

result = score(
    task="cpc_bond",
    response="...",
    final_diagnosis="...",
    model="google/gemini-2.5-pro",
    client=client,
)

References¶

Return Value¶

Every call returns a dict with three fields:

{
    "score": 4,                              # numeric score (or None on parse failure)
    "justification": "The correct diagnosis…", # LLM's explanation
    "raw": "{ \"score\": 4, ... }",           # full LLM response text
}

Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023. ↩
Cabral S, et al. Automated evaluation of clinical consultations using R-IDEA. 2024. ↩
Goh E, et al. Evaluating diagnostic reasoning in large language models. JAMA Network Open. 2024. ↩