菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-19
📄 Abstract - UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.

顶级标签: llm model training systems
详细标签: unsupervised learning code generation internal probing self-consistency representation-based estimation 或 搜索:

UCoder:通过内部探测大语言模型实现无监督代码生成 / UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models


1️⃣ 一句话总结

这篇论文提出了一种名为IPC的无监督框架,通过探测大语言模型内部的知识和置信度模式来自我生成高质量代码,无需依赖任何外部代码数据,从而在减少对标注数据和计算资源依赖的同时,取得了与有监督方法相媲美的性能。

源自 arXiv: 2512.17385