Papers
arxiv:2405.00716

Large Language Models in the Clinic: A Comprehensive Benchmark

Published on Oct 16, 2024
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A comprehensive benchmark named ClinicBench was developed to evaluate large language models on diverse clinical tasks including open-ended decision-making and long document processing, with evaluations conducted under zero-shot and few-shot settings.

AI-generated summary

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs. The benchmark data is available at https://github.com/AI-in-Health/ClinicBench.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.00716 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.00716 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.00716 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.