Read original: arXiv:2403.19790 - Published 4/1/2024 by Niall Taylor, Andrey Kormilitzin, Isabelle Lorge, Alejo Nevado-Holgado, Dan W Joyce
The paper discusses the potential of using large language models (LLMs) to assist clinicians in triaging patients to appropriate specialist mental healthcare teams within the UK's National Health Service (NHS). Mental healthcare in the NHS is stratified into primary, secondary, and tertiary services, with most patients requiring secondary care being referred to community mental health teams (CMHTs).

The referral process involves a written narrative from a general practitioner (GP) describing the patient's difficulties, symptoms, and relevant risks. CMHTs offer a single point-of-access and triage function, with some additional sub-specialist teams available for specific conditions like eating disorders or first-episode psychosis.

Triage decisions are made using the referral documentation and any available historical electronic health record (EHR) data. The process is time-consuming, subjective, and can lead to "referral bouncing" between teams, causing frustration for patients and referrers.

The authors propose using LLMs to assist clinicians in extracting relevant clinical data and allocating patients to appropriate teams, rather than automating the triage process entirely. This assisted triage approach could improve efficiency, transparency, and robustness of the triage process.

Challenges with Unstructured EHR Data

The paper discusses the challenges and considerations in using large language models (LLMs) for assisting in clinical triage tasks using electronic health record (EHR) data. The main points covered in this section include:

  1. Clinical natural language processing (NLP) has traditionally relied on information extraction, heuristics, and named entity recognition to locate relevant information from noisy, idiosyncratic clinical text. In contrast, the authors propose using LLMs to generate embeddings of patients' referral information and histories to determine the best triage team alignment.

  2. Managing variable token sequence lengths is a challenge when using transformer-based LLMs for long sequences due to the finite context window and quadratic time- and memory-complexity of the self-attention mechanism.

  3. The idiosyncratic nature of clinical language, particularly in mental health, poses difficulties for general-purpose LLMs trained on biomedical research literature. Domain adaptation techniques, such as continued training on specialized datasets, can help mitigate these issues.

  4. Clinical text in EHRs is often redundant and contains irrelevant information. The authors aim to develop approaches that automatically select or attend to the most relevant clinical information without requiring annotation.

  5. Efficiency is crucial when utilizing LLMs in clinical settings with limited compute resources and the need for transparency. The authors seek to develop tractable LLM pipelines that enable fine-tuning on downstream tasks.

  6. Related work has focused on using structured EHR data for patient embeddings and long sequence transformer-based approaches for predicting clinical outcomes, but few have investigated mental health-specific EHR data.

The desiderata for LLM-assisted triage include end-to-end ingestion of unstructured clinical text, resource efficiency, model interpretability, and the ability to train at scale without expert annotation.


The study used electronic health record data from approximately 200,000 patients spanning over a decade from Oxford Health NHS Foundation Trust. The data included narrative clinical notes and structured information related to referrals and discharges. A heuristic rule was developed to determine if a referral was accepted by a team based on the referral date and a 14-day cut-off. The study focused on five sub-specialty teams: eating disorders, mental health for people with intellectual disability, older-adults, early intervention for psychosis, and peri-natal psychiatry.

The study compared three approaches for handling variable token sequence lengths in the clinical notes:

A. Document-level 'brute force' approach: Each document is tokenized, truncated, and passed to the language model (LLM) for classification. The recommended triage team is determined by majority vote.

B. Instance-level single concatenated sequence approach: Documents are concatenated, tokenized, and truncated before being passed to the LLM. The recommended team is determined by the classification head.

C. Instance-level segment-and-batch approach: Documents are concatenated, tokenized, and divided into fixed-size segments. The segments are processed as a batch by the LLM, and the recommended team is determined by the classification head.

The study also utilized Low-Rank-Adaptation (LoRA) to improve training efficiency by reducing the number of trainable parameters during fine-tuning. The project was reviewed and approved by the Oversight Committee of the Oxford Health NHS Foundation Trust and the Research and Development Team.

Implementation details

Data Preprocessing and Model Training Details

The authors performed minimal data cleaning for language modeling with transformer-based models. Preprocessing steps included removing extra whitespace, carriage returns, tabs, and poorly encoded characters. Acronyms and jargon were intentionally kept to encourage the language models to learn the noise in clinical text.

Training, inference, and evaluation were performed on a single NVIDIA Tesla T4 16 GB GPU on a private AWS instance. The data was split based on unique patient identifiers to prevent data leakage. Sample numbers for the datasets are provided in Table 2.

Different modeling approaches, including Brute force, Concat truncated, Concat Longformer, and Segment-batch, were used. These approaches have varying compute requirements, so hyperparameters could not be fully aligned during training. The hyperparameters used for each approach are summarized in Table 3.


The paper compares three sequence representation methods for handling variable document and instance lengths in electronic health record (EHR) data. The methods are evaluated using RoBERTa models, either fine-tuned on OHFT data or without fine-tuning. The "segment-and-batch" method consistently performs best, and using LoRA for training efficiency incurs only a small performance degradation.

The paper also examines the relationship between instance length and classification performance. Instances are stratified into short, medium, long, and extra long sequences based on token counts. The results show that longer instances lead to better classifier performance for all three methods, with the "segment-and-batch" approach being consistently as good or better than the other methods across all sequence lengths.


The paper describes the development of a resource-efficient large language model (LLM) for triaging patient referrals to secondary care mental health teams using electronic health record (EHR) data. The segment-and-batch approach showed the best performance while providing the desired properties of an end-to-end system. The authors propose that their approach emulates the processes used by clinical teams in NHS community out-patient settings.

The segment-and-batch method enables consistent performance over a broad range of token sequence lengths and gracefully handles short or sparse clinical documents. The model utilizes a combination of administrative data and a heuristic based on expected clinical team behavior to deliver a training target for classifying instances.

Using LoRA and the segment-and-batch architecture, training and inference can be managed on a single GPU. The model also allows for interrogation using "interpretability through presentation." An additional benefit of using LoRA with an LLM is the potential for re-using a pre-trained base LLM as a general-purpose encoder for deriving embeddings for similar applications in mental healthcare.

Limitations of the study include the intentional use of smaller LLMs due to computational resource constraints, the need for further validation of the triage assistance system's alignment with clinical practice, and the inability to share the underlying models due to data privacy concerns.

Future work should focus on testing the acceptability and utility of the "interpretability by presentation" model with clinicians, and exploring the implementation of an ensemble of triage 'agents' specializing in detecting and representing signal for specific teams. However, the ensemble approach raises ethical questions regarding the potential reflection of biased and inequitable triaging behavior.


Acknowledgment The authors acknowledge the work and support provided by the Oxford Research Informatics Team. The team includes Tanya Smith, the Research Informatics Manager, Adam Pill and Suzanne Fisher, who are Research Informatics Systems Analysts, and Lulu Kane, the Research Informatics Administrator.


The authors acknowledge funding support from various sources. NT was supported by the EPSRC Center for Doctoral Training in Health Data Science. AK, ANH, IL, and DWJ received partial support from the NIHR AI Award for Health and Social Care. DWJ also received support from an NIHR Infrastructure Programme. AC was supported by the NIHR Oxford Cognitive Health Clinical Research Facility, an NIHR Research Professorship, the NIHR Oxford and Thames Valley Applied Research Collaboration, the NIHR Oxford Health Biomedical Research Centre, and the Wellcome Trust. The views expressed are those of the authors and not necessarily those of the UK National Health Service, the NIHR, or the UK Department of Health.


The authors N.T., D.W.J., A.K., and A.N.J. conceptualized the research. N.T. and D.W.J. curated the datasets and N.T. developed the code for pre-processing, running experiments, and analysis. N.T. and D.W.J. wrote the initial draft of the manuscript, which was then revised and edited by A.K., A.N.H., I.L., and A.C. All authors reviewed and approved the final manuscript.

Appendix A Triage and Team Referral Bouncing

The paper discusses the common pathway for patients to receive specialist secondary care through referrals from primary care physicians. It explains the concept of "referral bouncing," where a referral is forwarded from one team to another due to service pressures or arbitrary application of referral criteria. The authors acknowledge that there can be legitimate clinical reasons for referral forwarding, such as when a team functions as a single-point-of-access for a geographical region in addition to their assessment/treatment function.

The paper presents a descriptive analysis of initial first- and second-referral patterns in their dataset, using community mental health teams (CMHTs) as an example. CMHTs often receive referrals for patients in crisis, which they then refer to crisis resolution and home treatment (CRHTT) teams. The asymmetry in referral patterns is also discussed, with the example of patients referred to Early Intervention in Psychosis (EIP) teams having a high probability of being referred to CMHTs or CRHTTs.

The authors emphasize that using language models and classification to assist in triage is an inductive learning problem using a discriminative model. The electronic health record (EHR) data does not explicitly describe the clinical reasoning behind accepting or rejecting a referral, making it challenging to determine the reasons for referral bouncing in specific instances.

Figure 4: Tabulation of the probability of the team first receiving a referral (Team A, rows) referring the patient onto another team (team B, columns) within a 30 day window.

Appendix B Dataset Details

The paper presents a heuristic for determining whether a referral was accepted or rejected based on the available structured data in the electronic health record (EHR) system of Oxford Health NHS Foundation Trust. The heuristic considers a referral as rejected if it is discharged within 14 days of the referral date. This "14 day rule" is supported by the distribution of referral durations, which shows a large proportion of referrals being discharged on the same day and a slight peak around 14 days.

The paper also provides statistics on the number of tokens per individual document and per referral instance. The mean number of tokens per document is 183, with the 90th percentile at 388 tokens. For concatenated instances, the mean token count is 6,420, with the 90th percentile at 11,427 tokens. The distribution of token numbers per instance is similar for accepted and rejected referrals, with median values of 1,463 and 1,367 tokens, respectively.

Appendix C Toward Interpretable Triage Recommendations

The paper argues that providing mechanistic or intrinsic interpretability for contemporary AI systems built using 'black box' methods, such as LLMs with downstream classification tasks, may not be possible. Instead, the authors propose an approach called interpretability through presentation, which exposes key stages or steps in the computational process as graphical intuitions to allow clinicians to interrogate decisions made by the system.

The process involves:

  1. Mapping an instance to a location in a 768-dimensional embedding space
  2. Learning a mapping from the embedding space to the probability of being accepted by one of 5 sub-specialty teams

To present this process to users, the authors:

  1. Use dimensionality reduction to display a planar projection of the embedding space, providing a map of the population of instances and emphasizing clustering of similar referral instances
  2. Exploit label-aware attention weights to visually highlight instance tokens that contributed most to the classification, enabling users to inspect the 'source' information driving the triage recommendation

The authors present a prototype user interface and show how different types of clinical notes are handled by their assisted triage model. Four examples of clinical notes are provided to illustrate the system's performance:

  1. A mental state examination implying a psychotic episode, suggesting a referral to an early intervention for psychosis (EIP) team
  2. A clinical note summarizing a patient's history and clinical review, also suggesting an EIP team referral
  3. Three short summative notes from different healthcare professionals, strongly implying a referral to an older adult team
  4. A brief note with evidence of historical care from a learning disability team but current focus on occupational therapy and frailty, suggesting a referral to an older adult team

The model outputs for each example include:

  1. The clinical note with highlighting proportional to the model's label-aware attention weights
  2. A planar projection of the embeddings of all patients' instances, indicating the relative location of the query instance

Figure 7: Mental State Exam (MSE): A Visualisation of label-aware attention applied to the original synthetic text, where darker blue indicates higher soft-maxed attention scores. B planar projection (via t-SNE) of the training data set instance embeddings, with the query instance shown as a red-cross.

Figure 8: Summary note from an MDT meeting or discussion: A Visualisation of label-aware attention applied to the original synthetic text, where darker blue indicates higher soft-maxed attention scores. B planar projection (via t-SNE) of the training data set instance embeddings, with the query instance shown as a red-cross.

Figure 9: A short instance summarising a patient from different healthcare professionals: A Visualisation of label-aware attention applied to the original synthetic text, where darker blue indicates higher soft-maxed attention scores. B planar projection (via t-SNE) of the training data set intance embeddings, with the query instance shown as a red-cross.

Figure 10: An administrative note highlighting previous history with one team (learning disability) but where the content reflects needs appropriate to a different (older adult) team: A Visualisation of label-aware attention applied to the original synthetic text, where darker blue indicates higher soft-maxed attention scores. B planar projection (via t-SNE) of the training data set intance embeddings, with the query instance shown as a red-cross.

The model demonstrates the ability to identify important information for classifying triage team decisions. The examples show the model highlighting relevant details for specific teams, such as signs of schizophrenia for the Early Intervention Psychosis (EIP) team and memory problems and next steps for the older adult community mental health team (oaCMHT). The model's attention focuses on the most salient parts of the notes, emphasizing the key information needed for accurate team assignment.

Appendix D LoRA

LoRA (Low-Rank Adaptation) is a reparameterization technique for efficiently adapting large language models (LLMs). It approximates the weight update of any weight matrix in the LLM using two trainable matrices, A and B, which act as a low-rank approximation of the singular value decomposition (SVD) of the weight update. The rank of the LoRA matrices is a tunable parameter, typically much smaller than the dimensions of the original weight matrices.

In the forward pass, the weight matrices are updated by adding the product of A and B to the original weights. LoRA is commonly applied to the key, query, and value matrices in the transformer architecture, assuming that weight updates in LLMs have an intrinsically low rank compared to their dimensions.

Once trained, the LoRA matrices can be integrated into the model, resulting in no additional inference latency. During training, the original weight matrices of the LLM remain frozen, making LoRA an efficient training method for adapting large language models to new tasks or domains.

