用于代码注释分类的高质量数据增强方法 / High-quality data augmentation for code comment classification
1️⃣ 一句话总结
本文提出了一种名为Q-SYNTH的高质量数据增强技术,通过生成高质量的合成数据来有效解决代码注释分类任务中数据集规模小、类别不平衡的问题,从而将基础分类器的性能提升了2.56%。
Code comments serve a crucial role in software development for documenting functionality, clarifying design choices, and assisting with issue tracking. They capture developers' insights about the surrounding source code, serving as an essential resource for both human comprehension and automated analysis. Nevertheless, since comments are in natural language, they present challenges for machine-based code understanding. To address this, recent studies have applied natural language processing (NLP) and deep learning techniques to classify comments according to developers' intentions. However, existing datasets for this task suffer from size limitations and class imbalance, as they rely on manual annotations and may not accurately represent the distribution of comments in real-world codebases. To overcome this issue, we introduce new synthetic oversampling and augmentation techniques based on high-quality data generation to enhance the NLBSE'26 challenge datasets. Our Synthetic Quality Oversampling Technique and Augmentation Technique (Q-SYNTH) yield promising results, improving the base classifier by $2.56\%$.
用于代码注释分类的高质量数据增强方法 / High-quality data augmentation for code comment classification
本文提出了一种名为Q-SYNTH的高质量数据增强技术,通过生成高质量的合成数据来有效解决代码注释分类任务中数据集规模小、类别不平衡的问题,从而将基础分类器的性能提升了2.56%。
源自 arXiv: 2601.19383