Artificial intelligence outperforms doctors in summarizing health records, study shows

Artificial intelligence outperforms doctors in summarizing health records, study shows

In a recent study published in the journal Natural medicinean international team of researchers identified the best large-scale language models and adaptation methods for clinical summarization of large amounts of data from electronic health records and compared the performance of these models with that of medical experts.

Artificial intelligence outperforms doctors in summarizing health records, study showsStudy: Adapted large language models can outperform medical experts in summarizing clinical texts. Image credit: takasu / Shutterstock

Background

A labor-intensive but essential aspect of medical practice is the documentation of patients’ medical health records, containing progress reports, diagnostic tests, and specialist treatment histories. Clinicians often spend a significant amount of time collecting vast amounts of textual data, and even for highly experienced physicians, this process presents an opportunity for errors to be introduced, which can lead to serious medical and diagnostic problems.

The transition from paper records to electronic health records appears to have only expanded the workload of clinical documentation, and reports suggest that clinicians spend approximately two hours documenting clinical data from their interactions with a single patient. Nurses spend nearly 60% of their time on clinical documentation, and the time demands of this process often lead to significant stress and burnout, reducing job satisfaction among clinicians and ultimately leading to poorer patient outcomes.

Although large language models represent an excellent opportunity for generalizing clinical data, and these models have been evaluated for common natural language processing tasks, their performance and accuracy in generalizing clinical data have not been widely evaluated.

About the research

In the current study, researchers assessed eight major language patterns in four clinical summarization tasks, namely patient questions, radiology reports, doctor-patient dialogue, and progress notes.

They first used quantitative natural language processing metrics to determine which model and adaptation method performed best in the four generalization tasks. Ten doctors then conducted a clinical study where they compared the best summaries from large language models with those from medical experts on parameters such as brevity, correctness and completeness.

Finally, the researchers evaluated safety aspects to determine the challenges, such as information fabrication and the potential for medical harm, present in summarizing clinical data from medical experts and large language models.

Two broad approaches to language generation—autoregressive and seq2seq models—were used to estimate the eight major language models. Training seq2seq models requires paired data sets because they use an encoder-decoder architecture that maps input to output. These models work effectively for tasks involving generalization and machine translation.

On the other hand, autoregressive models do not require paired data sets, and these models are suitable for tasks such as dialogue and question-answer interactions and text generation. The study evaluates open-source autoregressive and seq2seq large language models, as well as some proprietary autoregressive models and two techniques for adapting pre-trained general-purpose large language models to perform domain-specific tasks.

The four task domains used to evaluate the large language models consisted of summarizing radiology reports using detailed data from radiology analyzes and results, summarizing patient questions into abbreviated queries, using progress notes to produce a list of medical problems and diagnoses and summarize the doctor-patient interactions in a paragraph about the assessment and plan.

Results

The results show that 45% of the summaries from the best adapted large language models are equivalent and 36% of them are better than those from medical experts. Additionally, in the clinical reader study, the grand language model summaries scored higher than the medical expert summaries on all three parameters of conciseness, correctness, and completeness.

In addition, the researchers found that “quick engineering,” or the process of tuning or modifying input prompts, significantly improved the model’s performance. This was evident, particularly with regard to the conciseness parameter, where specific prompts instructing the model to summarize the patient’s questions into specific word count prompts were helpful in meaningfully condensing the information.

Radiology reports were the only aspect where the conciseness of the large language models’ summaries was lower than that of the medical experts, and the researchers predicted that this might be due to the ambiguity of the input prompt, as the radiology report summary prompts did not specify word boundaries. However, they also believe that incorporating checks from other large language models or model ensembles, as well as human operators, can greatly improve the accuracy of this process.

Conclusions

Overall, the study found that using large language models to summarize data from patient health records performed as well or better than summarizing data from medical experts. Most of these large language models score higher than human operators in natural language processing metrics, concisely, correctly and fully summarizing the data. This process can potentially be implemented with further modifications and improvements to help clinicians save valuable time and improve patient care.

Journal reference:

  • Veen, V., Uden, V., Blankemeier, L., Delbrouck, J., Aali, A., Bluethgen, C., Pareek, A., Polacin, M., Reis, EP, Seehofnerová, A., Rohatgi , N., Hosamani, P., Collins, W., Ahuja, N., Langlotz, CP, Hom, J., Gatidis, S., Pauly, J., & Chaudhari, AS (2024). Adapted large language models can outperform medical experts in summarizing clinical texts. Natural medicine. DOI: 10.1038/s41591024028555, https://www.nature.com/articles/s41591-024-02855-5

Leave a Comment

Your email address will not be published. Required fields are marked *