菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-18
📄 Abstract - Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets

Synthetic data is widely used in healthcare to create datasets that are similar to original data but without the privacy concerns. Generating and evaluating synthetic data across privacy, utility and fairness is crucial for facilitating high quality data availability for downstream prediction tasks and clinical decision making. We present Memisis, a tool that orchestrates and evaluates synthetic data by leveraging existing synthetic data tools, the power of large language models and state-of-the-art evaluation metrics. Our tool creates a unified workflow for data generation, validation and evaluation. Users have control over the training size, training epochs and the number of synthetic rows to sample. Instead of knobs to tune synthetic data, the interactive agent allows users to specify their synthetic data generation goals and the tool will orchestrate the workflow by leveraging existing tools while performing the requisite evaluation. For the demo, we use an open source schizophrenia dataset with protected attributes related to race and gender, three different synthesizers and a local language model to orchestrate the workflow. We observe that CTGAN, TVAE and GaussianCopula have comparable performance across fairness and utility metrics. The workflow allows users flexibility and control over the data generation and evaluation process.

顶级标签: llm medical data
详细标签: synthetic data tabular data evaluation healthcare fairness 或 搜索:

Memisis:面向表格健康数据集的合成数据编排与评估 / Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets


1️⃣ 一句话总结

本文介绍了一个名为Memisis的工具,它能整合现有的合成数据生成工具和大语言模型,帮助医疗领域用户更灵活、可控地生成和评估既保护隐私又兼顾公平性和实用性的模拟健康数据。

源自 arXiv: 2605.17758