菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-25
📄 Abstract - A Pipeline for Generating Longitudinal Synthetic Clinical Notes Using Large Language Models

Synthetic data is increasingly used to enable the development and evaluation of AI systems in domains where access to real-world data is restricted. In healthcare, clinical documentation presents particular challenges due to its sensitivity. This work introduces a synthetic clinical notes pipeline and dataset designed to support the development of clinical AI tools while avoiding the privacy risks associated with real patient data. The dataset is generated using a modular pipeline that combines structured patient generation, semi-structured patient journey simulation, and unstructured clinical note generation using large language models. The pipeline is designed to prioritise internal consistency across longitudinal patient records, while also capturing variation in writing style, note structure, and clinical detail. Additional mechanisms, including LLM-based validation and augmentation steps, are used to improve faithfulness, realism, and diversity of the generated notes. We release a dataset of 70 synthetic patients, each associated with 20-50 clinical notes spanning a full hospital journey. The dataset is provided at multiple levels of validation, enabling users to balance realism and scalability depending on their use case. This dataset supports the development, testing, and evaluation of clinical AI systems, including summarisation tools, coding models, and decision support systems, without reliance on real patient data.

顶级标签: medical llm data
详细标签: synthetic data clinical notes longitudinal records data generation validation 或 搜索:

使用大语言模型生成纵向合成临床笔记的流程 / A Pipeline for Generating Longitudinal Synthetic Clinical Notes Using Large Language Models


1️⃣ 一句话总结

本文提出了一种利用大语言模型生成合成临床笔记的模块化流程,能够产生跨时间的、内部一致的模拟患者记录,既保护患者隐私,又为临床AI工具(如摘要、编码和决策支持系统)的开发与测试提供了高质量、多样化的训练数据。

源自 arXiv: 2606.26879