菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-12
📄 Abstract - LLMs can construct powerful representations and streamline sample-efficient supervised learning

As real-world datasets become increasingly complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data for downstream tasks, such as time-series, free text, and structured records, often requires non-trivial domain-specific engineering. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric-based approaches significantly outperform traditional count-feature models, naive text-serialization-based LLM baselines, and a clinical foundation model, which is pretrained on orders of magnitude more data. Beyond performance, rubrics offer several advantages for operational healthcare settings such as being easy to audit, cost-effectiveness to deploy at scale, and they can be converted to tabular representations that unlock a swath of machine learning techniques.

顶级标签: llm medical model training
详细标签: representation learning clinical nlp data serialization ehr analysis sample efficiency 或 搜索:

大语言模型能构建强大表征并简化样本高效的监督学习 / LLMs can construct powerful representations and streamline sample-efficient supervised learning


1️⃣ 一句话总结

这篇论文提出了一种利用大语言模型自动生成‘评分标准’的方法,将复杂的原始数据(如医疗记录)转化为标准化的有效表征,从而在少量样本下显著提升下游监督学习模型在临床任务上的性能,并具有易于审计和部署成本低等优点。

源自 arXiv: 2603.11679