Using an LLM Judge for One of Our Tasks¶
This vignette shows how to use PrecepTron's score() function to evaluate a single clinical response with an LLM judge, using any of the supported tasks and rubrics.
Install¶
Add Anthropic support if needed:
Score a Differential Diagnosis (Bond Score)¶
The Bond score1 rates a differential diagnosis on a 0--5 scale based on whether the correct diagnosis appears and how it is ranked.
from openai import OpenAI
from preceptron import score
client = OpenAI() # uses OPENAI_API_KEY
result = score(
task="cpc_bond",
response="1. Pheochromocytoma\n2. Thyroid storm\n3. Carcinoid syndrome",
final_diagnosis="Pheochromocytoma",
model="gpt-4o",
client=client,
)
print(result["score"]) # 5
print(result["justification"]) # why the judge gave that score
Score a Testing Plan¶
result = score(
task="cpc_management",
response="Order 24-hour urine catecholamines and metanephrines, CT abdomen with contrast",
test_plan="24-hour urine catecholamines",
model="gpt-4o",
client=client,
)
Score Consultation Quality (R-IDEA2)¶
result = score(
task="r_idea",
response="The patient presents with acute chest pain radiating to the back...",
question_text="Evaluate the quality of this emergency consultation.",
model="gpt-4o",
client=client,
)
Score Diagnostic Reasoning¶
The diagnostic reasoning task uses a multi-axis rubric (0--19)3 that evaluates the breadth and depth of clinical reasoning.
result = score(
task="diagnostic_reasoning",
response="Given the history of progressive dyspnea and bilateral infiltrates...",
final_diagnosis="Pulmonary alveolar proteinosis",
case_vignette="A 45-year-old construction worker presents with...",
question_text="What is the most likely diagnosis?",
model="gpt-4o",
client=client,
)
Score a Management Plan (Custom Rubric)¶
Management reasoning requires a case-specific rubric because each question has different expected elements:
result = score(
task="management_reasoning",
response="Start IV heparin, obtain CT angiography of the chest...",
case_vignette="A 72 year-old woman admitted for gallstone pancreatitis...",
question_text="What is your initial management plan?",
rubric={
"question_text": "What is your initial management plan?",
"max_score": 5,
"rubric_items": [
{"points": 1, "text": "Anticoagulation"},
{"points": 1, "text": "CT angiography"},
{"points": 1, "text": "Echocardiography"},
{"points": 1, "text": "IVC filter consideration"},
{"points": 1, "text": "Oxygen supplementation"},
],
},
model="gpt-4o",
client=client,
)
Using Anthropic Models¶
Swap the client -- everything else stays the same:
from anthropic import Anthropic
from preceptron import score
client = Anthropic() # uses ANTHROPIC_API_KEY
result = score(
task="cpc_bond",
response="1. Anti-IgLON5 disease\n2. Narcolepsy",
final_diagnosis="Anti-IgLON5-associated neurologic disorder",
model="claude-sonnet-4-20250514",
client=client,
)
Using OpenRouter or Other OpenAI-Compatible APIs¶
from openai import OpenAI
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="sk-or-...")
result = score(
task="cpc_bond",
response="...",
final_diagnosis="...",
model="google/gemini-2.5-pro",
client=client,
)
References¶
Return Value¶
Every call returns a dict with three fields:
{
"score": 4, # numeric score (or None on parse failure)
"justification": "The correct diagnosis…", # LLM's explanation
"raw": "{ \"score\": 4, ... }", # full LLM response text
}
-
Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023. ↩
-
Cabral S, et al. Automated evaluation of clinical consultations using R-IDEA. 2024. ↩
-
Goh E, et al. Evaluating diagnostic reasoning in large language models. JAMA Network Open. 2024. ↩