Skip to content

Using an LLM Judge for One of Our Tasks

This vignette shows how to use PrecepTron's score() function to evaluate a single clinical response with an LLM judge, using any of the supported tasks and rubrics.

Install

pip install git+https://github.com/2v/Preceptron3.git

Add Anthropic support if needed:

pip install "preceptron[anthropic] @ git+https://github.com/2v/Preceptron3.git"

Score a Differential Diagnosis (Bond Score)

The Bond score1 rates a differential diagnosis on a 0--5 scale based on whether the correct diagnosis appears and how it is ranked.

from openai import OpenAI
from preceptron import score

client = OpenAI()  # uses OPENAI_API_KEY

result = score(
    task="cpc_bond",
    response="1. Pheochromocytoma\n2. Thyroid storm\n3. Carcinoid syndrome",
    final_diagnosis="Pheochromocytoma",
    model="gpt-4o",
    client=client,
)

print(result["score"])          # 5
print(result["justification"])  # why the judge gave that score

Score a Testing Plan

result = score(
    task="cpc_management",
    response="Order 24-hour urine catecholamines and metanephrines, CT abdomen with contrast",
    test_plan="24-hour urine catecholamines",
    model="gpt-4o",
    client=client,
)

Score Consultation Quality (R-IDEA2)

result = score(
    task="r_idea",
    response="The patient presents with acute chest pain radiating to the back...",
    question_text="Evaluate the quality of this emergency consultation.",
    model="gpt-4o",
    client=client,
)

Score Diagnostic Reasoning

The diagnostic reasoning task uses a multi-axis rubric (0--19)3 that evaluates the breadth and depth of clinical reasoning.

result = score(
    task="diagnostic_reasoning",
    response="Given the history of progressive dyspnea and bilateral infiltrates...",
    final_diagnosis="Pulmonary alveolar proteinosis",
    case_vignette="A 45-year-old construction worker presents with...",
    question_text="What is the most likely diagnosis?",
    model="gpt-4o",
    client=client,
)

Let the Router Pick the Rubric

If you don't know which rubric fits your response — or you want every applicable rubric applied at once — omit task=. PrecepTron's router LLM inspects your response and any context fields you provide and picks one or more rubrics from the set it is eligible to run.

result = score(
    response=(
        "Differential: (1) pheochromocytoma, (2) thyroid storm, "
        "(3) carcinoid syndrome. Pheochromocytoma is most likely given "
        "episodic hypertension, palpitations, and elevated plasma "
        "metanephrines..."
    ),
    final_diagnosis="Pheochromocytoma",
    case_vignette=(
        "A 42-year-old woman presents with episodic headaches, "
        "palpitations, and diaphoresis..."
    ),
    question_text="What is the most likely diagnosis and why?",
    model="gpt-4o",
    client=client,
)

print(result["router"]["tasks"])
# ['cpc_bond', 'diagnostic_reasoning']

for task_name, r in result["results"].items():
    print(task_name, r["score"])
# cpc_bond 5
# diagnostic_reasoning 17

The router only considers tasks whose required context is present. Passing just final_diagnosis routes to cpc_bond; add a case_vignette and question_text and diagnostic_reasoning becomes eligible too. When the router picks exactly one task the return value is flattened to the usual single-task shape (with an added router key recording the choice); otherwise results come back keyed under results.

Override the router's model with router_model="..." if you want to use a cheaper model for routing than for scoring.

Score a Management Plan (Custom Rubric)

Management reasoning requires a case-specific rubric because each question has different expected elements:

result = score(
    task="management_reasoning",
    response="Start IV heparin, obtain CT angiography of the chest...",
    case_vignette="A 72 year-old woman admitted for gallstone pancreatitis...",
    question_text="What is your initial management plan?",
    rubric={
        "question_text": "What is your initial management plan?",
        "max_score": 5,
        "rubric_items": [
            {"points": 1, "text": "Anticoagulation"},
            {"points": 1, "text": "CT angiography"},
            {"points": 1, "text": "Echocardiography"},
            {"points": 1, "text": "IVC filter consideration"},
            {"points": 1, "text": "Oxygen supplementation"},
        ],
    },
    model="gpt-4o",
    client=client,
)

Using Anthropic Models

Swap the client -- everything else stays the same:

from anthropic import Anthropic
from preceptron import score

client = Anthropic()  # uses ANTHROPIC_API_KEY

result = score(
    task="cpc_bond",
    response="1. Anti-IgLON5 disease\n2. Narcolepsy",
    final_diagnosis="Anti-IgLON5-associated neurologic disorder",
    model="claude-sonnet-4-20250514",
    client=client,
)

Using OpenRouter or Other OpenAI-Compatible APIs

from openai import OpenAI

client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="sk-or-...")

result = score(
    task="cpc_bond",
    response="...",
    final_diagnosis="...",
    model="google/gemini-2.5-pro",
    client=client,
)

References

Return Value

Every call returns a dict with three fields:

{
    "score": 4,                              # numeric score (or None on parse failure)
    "justification": "The correct diagnosis…", # LLM's explanation
    "raw": "{ \"score\": 4, ... }",           # full LLM response text
}

  1. Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023. doi:10.1001/jama.2023.8288 

  2. Cabral S, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Internal Medicine. 2024. doi:10.1001/jamainternmed.2024.0295 

  3. Goh E, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Network Open. 2024. doi:10.1001/jamanetworkopen.2024.40969