Authors: Hejie Cui*, Alyssa Unell*, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, Nigam Shah,
TIMER⌛️ is a temporal instruction modeling and evaluation framework for longitudinal clinical records! 🏥📈 TIMER tackles challenges in processing longitudinal medical records—including temporal reasoning, multi-visit synthesis, and patient trajectory analysis. It introduces:
🔹 Time-aware benchmarks to evaluate temporal reasoning abilities
🔹 Temporal instruction tuning for enhanced longitudinal understanding
🔹 Distribution-aware training strategies for balanced temporal coverage
Overview of TIMER framework. TIMER-Bench creates evaluation sets with explicit temporal evidence, covering questions across different time periods in patient histories to assess longitudinal EHR reasoning. Right: TIMER-Instruct enhances model performance through instruction tuning with instruction-response pairs generated by LLMs that distribute temporally diverse across EHR timelines.
Create a conda environment from the enviornment conda file:
-conda env create -f environment.yml
The scripts for instructional data generation are located in the timer/instruct_gen/
folder.
The instructional generation pipeline is composed of several steps.
sample_patients.py
connects with the BigQuery Table and IID sample a set of patient IDS. Example commands:
python sample_patients.py --sampling_method random --sample_n 5
generator.py
uses the async io to call the API and generate the instruction set, based on the materialized ehr records. The generation is based on a context window of 16k.
python alpaca_format.py
: Parse the generated jsons and convert the synthetic instruction-response pairs into the format of Alpaca instructions (more details refer to alpaca_data.json):
- Specify where the instruction json is located (Currently we support json formatted instruction-response pairs with the EHR ID as the file name)
- Specify where the reformatted data should be saved to
- Specify where the EHR files can be found (Currently we support one jsonl file with entries containing
uid
to indicate EHR ID and match to instruction file name). - Specify what model's tokenizer you will be using in training. This is to ensure that we truncate the retrieved EHR to fit within the context length of the model that will be instruction-tuned.
- Specify the context length for truncating EHR IF AND ONLY IF we are doing naive chunking
# parse LLM generated response and convert the instruction set to alpaca format for SFT
python alpaca_format.py --instruction_folder {TODO} --output_folder {TODO} --ehr_data_path {TODO} --context_length {TODO}
Alternatively, check out the file alpaca_format.sh
for all default arguments and paths.
The output JSON file ./data/ehr_data.json
is a list of dictionaries, each dictionary contains the following fields:
instruction
: str, generated tasks on what the model should perform. Each instruction is unique.input
: str, context or input for the task. Here we put the EHR records as the ground of the instruction-response pairs.output
: str, the answer to the instruction.
We will use the following prompts for model fine-tuning:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
The scripts for instruction-response pairs selection is located in the timer/instruct_select/
folder
The script temporal_selection.py
supports sampling instruction sets that follow different temporal distributions: Recency-focused
, Edge-focused
, and Uniformly distributed
. The script balanced_bench_sampling.py
performs subset sampling that follows a uniform distribution.
The code for model instruction-tuning is located in the timer/instruct_tune/
folder.
instruct_tune/tune_llama_recipes.py
: instruct-tune Llama-3 with the generated instruction set and include short context validation on MMLU college medicine and clinical knowledge benchmarks.
The fine-tuned model will be saved in the --output_dir
folder (which is ./models
folder under your project path) for inference and evaluation. The hyperparameters were found via wandb sweep to be optimal for our current set-up.
To change the base model which we perform PEFT fine-tuning on, simply update the --model_name
parameter. To use wandb, add --use_wandb
to the parameters. You'll need to log in to wandb. You can do this by running wandb login
and then entering your API key when prompted. The entity and project is set up in the script.
The code for evaluation is located in the timer/evaluate/
folder.
To evaluate both fine-tuned and baseline models on MedAlign, we can use the following code.
You will need to first preprocess the MedAlign instruction, response, EHR triplets for future inference such that it has the correct context length and generation length parameters.
In inference, the command line argument of --enable_lora
differentiates between baselines and fine-tuned models. If you want to run a fine tuned model, you turn on this flag as well as --lora_path
where you then point to the generated model checkpoint from the SFT stage. the --path_to_prompts
argument points to the preprocessed output from the above "Preprocessing" step. The output of this command can be used for NLP evaluations, LLM-as-judge evaluations, and DocLens evaluations.
For NLP evaluation of MedAlign output, you will need to ensure that you have access to where the reference answers are. We will use these reference answers to compare to the generated responses from Inference above.
This guide explains how to generate correctness
and completeness
metrics using the LLM-as-Judge evaluator.
Place your model responses and reference materials under the result/
directory with the following structure:
First, put all your model's responses and the reference responses under result/
. We will also needs clinician-instruction-responses.csv
which contains the reference response for each instruction in MedAlign and full_patient_ehrs/*
which include the relevant EHRs for the instructions.
Required files/args:
clinician-instruction-responses.csv
: Reference responses for MedAlign instructionsfull_patient_ehrs/*
: Relevant EHRs for the instructions- model generated responses in CSV format
./run_judge.sh
We offer an alternative method for head-to-head evaluation, where a judge compares two model responses to determine which is better by referencing the reference answer.
To compare model responses from model_a
and model_b
, use the following command:
python head_to_head_eval.py --model_a_responses ../result/baseline/test_A_response.csv --model_b_responses ../result/baseline/test_B_response.csv --output_file ../result/head_to_head/{model_a}_{model_b}.json
In this command:
--model_a_responses
: Path to the CSV file containing responses from model A.--model_b_responses
: Path to the CSV file containing responses from model B.--reference_answers
: Path to the CSV file containing the reference answers.--output_file
: Path where the results of the head-to-head evaluation will be saved, with{model_a}
and{model_b}
replaced by the actual model names.
We evaluate TIMER on both human-annotated and model-generated benchmarks:
📈 +7.3% improvement on physician-generated MedAlign benchmark
📈 +9.2% improvement on temporal reasoning on TIMER-Bench
More results and additional metrics are available in the paper.
Text generation conditioned on a long-context input demonstrates a lost in the middle effect– indicating that we need to consider how we sample from our distribution for full longitudinal coverage.
To evaluate the impact of different temporal distribution strategies for instruction tuning on EHR reasoning tasks, we conducted evaluations on benchmarks with three different temporal distributions: (1) the human-annotated benchmark MedAlign which shows a recency-focused distribution, (2) an edge-focused TIMER-Bench where evaluation instruction-response pairs are randomly sampled from the natural model-generated distribution and (3) a uniform-distributed TIMER-Bench where the evaluation instruction-response pairs are sampled with equal frequency across all patient visits.
We thank llama-cookbook for open-sourcing the model training frameworks that we used in this work.
@article{cui2025timer,
title={TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records},
author={Cui, Hejie and Unell, Alyssa and Chen, Bowen and Fries, Jason Alan and Alsentzer, Emily and Koyejo, Sanmi and Shah, Nigam},
journal={arXiv preprint arXiv:2503.04176},
year={2025}
}