Skip to content

Using an LLM Judge for One of Our Tasks

This vignette shows how to use PrecepTron's score() function to evaluate a single clinical response with an LLM judge, using any of the supported tasks and rubrics.

Install

pip install git+https://github.com/2v/Preceptron3.git

Add Anthropic support if needed:

pip install "preceptron[anthropic] @ git+https://github.com/2v/Preceptron3.git"

Score a Differential Diagnosis (Bond Score)

The Bond score1 rates a differential diagnosis on a 0--5 scale based on whether the correct diagnosis appears and how it is ranked.

from openai import OpenAI
from preceptron import score

client = OpenAI()  # uses OPENAI_API_KEY

result = score(
    task="cpc_bond",
    response="1. Pheochromocytoma\n2. Thyroid storm\n3. Carcinoid syndrome",
    final_diagnosis="Pheochromocytoma",
    model="gpt-4o",
    client=client,
)

print(result["score"])          # 5
print(result["justification"])  # why the judge gave that score

Score a Testing Plan

result = score(
    task="cpc_management",
    response="Order 24-hour urine catecholamines and metanephrines, CT abdomen with contrast",
    test_plan="24-hour urine catecholamines",
    model="gpt-4o",
    client=client,
)

Score Consultation Quality (R-IDEA2)

result = score(
    task="r_idea",
    response="The patient presents with acute chest pain radiating to the back...",
    question_text="Evaluate the quality of this emergency consultation.",
    model="gpt-4o",
    client=client,
)

Score Diagnostic Reasoning

The diagnostic reasoning task uses a multi-axis rubric (0--19)3 that evaluates the breadth and depth of clinical reasoning.

result = score(
    task="diagnostic_reasoning",
    response="Given the history of progressive dyspnea and bilateral infiltrates...",
    final_diagnosis="Pulmonary alveolar proteinosis",
    case_vignette="A 45-year-old construction worker presents with...",
    question_text="What is the most likely diagnosis?",
    model="gpt-4o",
    client=client,
)

Score a Management Plan (Custom Rubric)

Management reasoning requires a case-specific rubric because each question has different expected elements:

result = score(
    task="management_reasoning",
    response="Start IV heparin, obtain CT angiography of the chest...",
    case_vignette="A 72 year-old woman admitted for gallstone pancreatitis...",
    question_text="What is your initial management plan?",
    rubric={
        "question_text": "What is your initial management plan?",
        "max_score": 5,
        "rubric_items": [
            {"points": 1, "text": "Anticoagulation"},
            {"points": 1, "text": "CT angiography"},
            {"points": 1, "text": "Echocardiography"},
            {"points": 1, "text": "IVC filter consideration"},
            {"points": 1, "text": "Oxygen supplementation"},
        ],
    },
    model="gpt-4o",
    client=client,
)

Using Anthropic Models

Swap the client -- everything else stays the same:

from anthropic import Anthropic
from preceptron import score

client = Anthropic()  # uses ANTHROPIC_API_KEY

result = score(
    task="cpc_bond",
    response="1. Anti-IgLON5 disease\n2. Narcolepsy",
    final_diagnosis="Anti-IgLON5-associated neurologic disorder",
    model="claude-sonnet-4-20250514",
    client=client,
)

Using OpenRouter or Other OpenAI-Compatible APIs

from openai import OpenAI

client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="sk-or-...")

result = score(
    task="cpc_bond",
    response="...",
    final_diagnosis="...",
    model="google/gemini-2.5-pro",
    client=client,
)

References

Return Value

Every call returns a dict with three fields:

{
    "score": 4,                              # numeric score (or None on parse failure)
    "justification": "The correct diagnosis…", # LLM's explanation
    "raw": "{ \"score\": 4, ... }",           # full LLM response text
}

  1. Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023. 

  2. Cabral S, et al. Automated evaluation of clinical consultations using R-IDEA. 2024. 

  3. Goh E, et al. Evaluating diagnostic reasoning in large language models. JAMA Network Open. 2024.