Skip to content

Generating Truncation Curves

This vignette shows how to run the truncation experiment, where an LLM receives incrementally longer snippets of a clinical case and generates a differential diagnosis at each step. The resulting curves reveal how diagnostic accuracy evolves as more clinical information is revealed.

Background

Published CPC (clinicopathological conference) cases from the New England Journal of Medicine have a natural narrative structure: the case presentation gradually reveals demographics, history, exam findings, labs, and imaging. By feeding the model increasing token-length prefixes of the case text, we can trace how its differential diagnosis changes over time and measure when (and whether) it converges on the correct diagnosis.

Each truncation point produces a full differential diagnosis (top 10 ranked diagnoses) and a testing plan, which are then graded using the Bond score1 and testing plan rubric via the PrecepTron autograder. The case dataset is drawn from NEJM CPCs previously studied by Buckley et al.23

1. Setup

git clone https://github.com/2v/Preceptron3.git
cd Preceptron3
pip install -e .
pip install pyyaml tiktoken tqdm

Create benchmarking/config.ini with your API keys (the truncation runner shares the same key format):

[api_keys]
openai = sk-...

[azure]
api_key = ...
endpoint = https://your-endpoint.openai.azure.com
api_version = 2024-12-01-preview

2. Configure the Experiment

Edit token_by_token/config.yaml:

dataset_path: "../study_data/cpcs_bond/cpcs_base_2015_4_cleaned.json"
output_dir: "../token_by_token/trials"

# Number of most recent cases (by publication_date). null = all cases.
num_cases: 10

num_threads: 1

models:
  - name: "gpt-4o"
    provider: "openai"
    model_id: "gpt-4o"

# Model used to grade the outputs (Bond score + testing plan)
grading_model:
  name: "gpt-4o"
  provider: "openai"
  model_id: "gpt-4o"

Key settings:

  • num_cases -- limit to a subset for testing, or set to null for all cases.
  • models -- the LLM(s) that will read the truncated case and generate differentials.
  • grading_model -- the LLM judge that scores each differential using the Bond score and testing plan rubrics.

3. Run the Truncation Experiment

cd token_by_token
python run_token_by_token.py --config config.yaml --keys ../benchmarking/config.ini

For each case, the runner:

  1. Tokenizes the full case presentation.
  2. Creates truncation points: every 10 tokens up to 200, then every 100 tokens to the end.
  3. At each point, decodes the token prefix back to text and prompts the model for a differential diagnosis.
  4. Saves all responses to trials/{model_name}_{case_id}.json.

4. Grade the Outputs

After generating responses, grade them with the autograder:

python grade_trials.py --config config.yaml --keys ../benchmarking/config.ini

This scores each truncation-point response using:

  • Bond score (0--5) -- did the correct diagnosis appear in the differential, and how was it ranked?
  • Testing plan (0--2) -- were appropriate diagnostic tests proposed?

5. Plot the Curves

Use the analysis script to visualize how scores change with increasing case information:

import json
import matplotlib.pyplot as plt

with open("trials/gpt-4o_case123_graded.json") as f:
    trials = json.load(f)

tokens = [t["tokens"] for t in trials]
bond_scores = [t["bond_score"] for t in trials]

plt.figure(figsize=(10, 5))
plt.plot(tokens, bond_scores, marker="o", linewidth=2)
plt.xlabel("Tokens revealed")
plt.ylabel("Bond score (0–5)")
plt.title("Diagnostic accuracy vs. case information")
plt.ylim(-0.2, 5.2)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("truncation_curve.png", dpi=150)
plt.show()

Placeholder

Save a single-case truncation curve to docs/assets/truncation_curve_single.png and uncomment the image above.

A dedicated plotting module is available in analysis/token_by_token_plotting.py for batch visualization across models and cases.

6. Compare Models

Run the experiment with multiple models to compare diagnostic convergence:

models:
  - name: "gpt-4o"
    provider: "openai"
    model_id: "gpt-4o"
  - name: "o1"
    provider: "openai"
    model_id: "o1"
  - name: "claude-sonnet"
    provider: "anthropic"
    model_id: "claude-sonnet-4-20250514"

Then overlay the curves to see which model reaches the correct diagnosis earliest.

Placeholder

Save a multi-model comparison plot to docs/assets/truncation_curves_comparison.png and uncomment the image above.

Truncation Schedule

The default truncation schedule balances resolution and cost:

Token range Increment Rationale
0--200 Every 10 tokens Fine-grained: captures initial demographic and chief complaint
200+ Every 100 tokens Coarser: tracks evolution through history, exam, and labs

You can modify the schedule in run_token_by_token.py by changing the token_increments list.

Placeholder

Save an aggregate results figure (e.g. mean Bond score across all cases by token count) to docs/assets/truncation_aggregate.png and uncomment the image above.

References


  1. Kanjee Z, et al. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023. 

  2. Buckley T, et al. Multimodal clinical reasoning in large language models. 2024. 

  3. Buckley T, et al. Open-source versus closed-source large language models for clinical reasoning. 2025.