Generating Truncation Curves¶
This vignette shows how to run the truncation experiment, where an LLM receives incrementally longer snippets of a clinical case and generates a differential diagnosis at each step. The resulting curves reveal how diagnostic accuracy evolves as more clinical information is revealed.
Background¶
Published CPC (clinicopathological conference) cases from the New England Journal of Medicine have a natural narrative structure: the case presentation gradually reveals demographics, history, exam findings, labs, and imaging. By feeding the model increasing token-length prefixes of the case text, we can trace how its differential diagnosis changes over time and measure when (and whether) it converges on the correct diagnosis.
Each truncation point produces a full differential diagnosis (top 10 ranked diagnoses) and a testing plan, which are then graded using the Bond score1 and testing plan rubric via the PrecepTron autograder. The case dataset is drawn from NEJM CPCs previously studied by Buckley et al.23
1. Setup¶
git clone https://github.com/2v/Preceptron3.git
cd Preceptron3
pip install -e .
pip install pyyaml tiktoken tqdm
Create benchmarking/config.ini with your API keys (the truncation runner shares the same key format):
[api_keys]
openai = sk-...
[azure]
api_key = ...
endpoint = https://your-endpoint.openai.azure.com
api_version = 2024-12-01-preview
2. Configure the Experiment¶
Edit token_by_token/config.yaml:
dataset_path: "../study_data/cpcs_bond/cpcs_base_2015_4_cleaned.json"
output_dir: "../token_by_token/trials"
# Number of most recent cases (by publication_date). null = all cases.
num_cases: 10
num_threads: 1
models:
- name: "gpt-4o"
provider: "openai"
model_id: "gpt-4o"
# Model used to grade the outputs (Bond score + testing plan)
grading_model:
name: "gpt-4o"
provider: "openai"
model_id: "gpt-4o"
Key settings:
num_cases-- limit to a subset for testing, or set tonullfor all cases.models-- the LLM(s) that will read the truncated case and generate differentials.grading_model-- the LLM judge that scores each differential using the Bond score and testing plan rubrics.
3. Run the Truncation Experiment¶
cd token_by_token
python run_token_by_token.py --config config.yaml --keys ../benchmarking/config.ini
For each case, the runner:
- Tokenizes the full case presentation.
- Creates truncation points: every 10 tokens up to 200, then every 100 tokens to the end.
- At each point, decodes the token prefix back to text and prompts the model for a differential diagnosis.
- Saves all responses to
trials/{model_name}_{case_id}.json.
4. Grade the Outputs¶
After generating responses, grade them with the autograder:
This scores each truncation-point response using:
- Bond score (0--5) -- did the correct diagnosis appear in the differential, and how was it ranked?
- Testing plan (0--2) -- were appropriate diagnostic tests proposed?
5. Plot the Curves¶
Use the analysis script to visualize how scores change with increasing case information:
import json
import matplotlib.pyplot as plt
with open("trials/gpt-4o_case123_graded.json") as f:
trials = json.load(f)
tokens = [t["tokens"] for t in trials]
bond_scores = [t["bond_score"] for t in trials]
plt.figure(figsize=(10, 5))
plt.plot(tokens, bond_scores, marker="o", linewidth=2)
plt.xlabel("Tokens revealed")
plt.ylabel("Bond score (0–5)")
plt.title("Diagnostic accuracy vs. case information")
plt.ylim(-0.2, 5.2)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("truncation_curve.png", dpi=150)
plt.show()
Placeholder
Save a single-case truncation curve to docs/assets/truncation_curve_single.png and uncomment the image above.
A dedicated plotting module is available in analysis/token_by_token_plotting.py for batch visualization across models and cases.
6. Compare Models¶
Run the experiment with multiple models to compare diagnostic convergence:
models:
- name: "gpt-4o"
provider: "openai"
model_id: "gpt-4o"
- name: "o1"
provider: "openai"
model_id: "o1"
- name: "claude-sonnet"
provider: "anthropic"
model_id: "claude-sonnet-4-20250514"
Then overlay the curves to see which model reaches the correct diagnosis earliest.
Placeholder
Save a multi-model comparison plot to docs/assets/truncation_curves_comparison.png and uncomment the image above.
Truncation Schedule¶
The default truncation schedule balances resolution and cost:
| Token range | Increment | Rationale |
|---|---|---|
| 0--200 | Every 10 tokens | Fine-grained: captures initial demographic and chief complaint |
| 200+ | Every 100 tokens | Coarser: tracks evolution through history, exam, and labs |
You can modify the schedule in run_token_by_token.py by changing the token_increments list.
Placeholder
Save an aggregate results figure (e.g. mean Bond score across all cases by token count) to docs/assets/truncation_aggregate.png and uncomment the image above.
References¶
-
Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023. ↩
-
Buckley T, et al. Multimodal clinical reasoning in large language models. 2024. ↩
-
Buckley T, et al. Open-source versus closed-source large language models for clinical reasoning. 2025. ↩