Using an LLM Judge for One of Our Tasks¶
This vignette shows how to use PrecepTron's score() function to evaluate a single clinical response with an LLM judge, using any of the supported tasks and rubrics.
Install¶
Add Anthropic support if needed:
Score a Differential Diagnosis (Bond Score)¶
The Bond score1 rates a differential diagnosis on a 0--5 scale based on whether the correct diagnosis appears and how it is ranked.
from openai import OpenAI
from preceptron import score
client = OpenAI() # uses OPENAI_API_KEY
result = score(
task="cpc_bond",
response="1. Pheochromocytoma\n2. Thyroid storm\n3. Carcinoid syndrome",
final_diagnosis="Pheochromocytoma",
model="gpt-4o",
client=client,
)
print(result["score"]) # 5
print(result["justification"]) # why the judge gave that score
Score a Testing Plan¶
result = score(
task="cpc_management",
response="Order 24-hour urine catecholamines and metanephrines, CT abdomen with contrast",
test_plan="24-hour urine catecholamines",
model="gpt-4o",
client=client,
)
Score Consultation Quality (R-IDEA2)¶
result = score(
task="r_idea",
response="The patient presents with acute chest pain radiating to the back...",
question_text="Evaluate the quality of this emergency consultation.",
model="gpt-4o",
client=client,
)
Score Diagnostic Reasoning¶
The diagnostic reasoning task uses a multi-axis rubric (0--19)3 that evaluates the breadth and depth of clinical reasoning.
result = score(
task="diagnostic_reasoning",
response="Given the history of progressive dyspnea and bilateral infiltrates...",
final_diagnosis="Pulmonary alveolar proteinosis",
case_vignette="A 45-year-old construction worker presents with...",
question_text="What is the most likely diagnosis?",
model="gpt-4o",
client=client,
)
Let the Router Pick the Rubric¶
If you don't know which rubric fits your response — or you want every
applicable rubric applied at once — omit task=. PrecepTron's router LLM
inspects your response and any context fields you provide and picks one or
more rubrics from the set it is eligible to run.
result = score(
response=(
"Differential: (1) pheochromocytoma, (2) thyroid storm, "
"(3) carcinoid syndrome. Pheochromocytoma is most likely given "
"episodic hypertension, palpitations, and elevated plasma "
"metanephrines..."
),
final_diagnosis="Pheochromocytoma",
case_vignette=(
"A 42-year-old woman presents with episodic headaches, "
"palpitations, and diaphoresis..."
),
question_text="What is the most likely diagnosis and why?",
model="gpt-4o",
client=client,
)
print(result["router"]["tasks"])
# ['cpc_bond', 'diagnostic_reasoning']
for task_name, r in result["results"].items():
print(task_name, r["score"])
# cpc_bond 5
# diagnostic_reasoning 17
The router only considers tasks whose required context is present. Passing
just final_diagnosis routes to cpc_bond; add a case_vignette and
question_text and diagnostic_reasoning becomes eligible too. When the
router picks exactly one task the return value is flattened to the usual
single-task shape (with an added router key recording the choice);
otherwise results come back keyed under results.
Override the router's model with router_model="..." if you want to use a
cheaper model for routing than for scoring.
Score a Management Plan (Custom Rubric)¶
Management reasoning requires a case-specific rubric because each question has different expected elements:
result = score(
task="management_reasoning",
response="Start IV heparin, obtain CT angiography of the chest...",
case_vignette="A 72 year-old woman admitted for gallstone pancreatitis...",
question_text="What is your initial management plan?",
rubric={
"question_text": "What is your initial management plan?",
"max_score": 5,
"rubric_items": [
{"points": 1, "text": "Anticoagulation"},
{"points": 1, "text": "CT angiography"},
{"points": 1, "text": "Echocardiography"},
{"points": 1, "text": "IVC filter consideration"},
{"points": 1, "text": "Oxygen supplementation"},
],
},
model="gpt-4o",
client=client,
)
Using Anthropic Models¶
Swap the client -- everything else stays the same:
from anthropic import Anthropic
from preceptron import score
client = Anthropic() # uses ANTHROPIC_API_KEY
result = score(
task="cpc_bond",
response="1. Anti-IgLON5 disease\n2. Narcolepsy",
final_diagnosis="Anti-IgLON5-associated neurologic disorder",
model="claude-sonnet-4-20250514",
client=client,
)
Using OpenRouter or Other OpenAI-Compatible APIs¶
from openai import OpenAI
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="sk-or-...")
result = score(
task="cpc_bond",
response="...",
final_diagnosis="...",
model="google/gemini-2.5-pro",
client=client,
)
References¶
Return Value¶
Every call returns a dict with three fields:
{
"score": 4, # numeric score (or None on parse failure)
"justification": "The correct diagnosis…", # LLM's explanation
"raw": "{ \"score\": 4, ... }", # full LLM response text
}
-
Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023. doi:10.1001/jama.2023.8288 ↩
-
Cabral S, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Internal Medicine. 2024. doi:10.1001/jamainternmed.2024.0295 ↩
-
Goh E, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Network Open. 2024. doi:10.1001/jamanetworkopen.2024.40969 ↩