在干草堆中寻找针:生态学中的传导式主动标注 / Finding Needles in the Haystack: Transductive Active Labeling in Ecology
1️⃣ 一句话总结
本文指出当前生态学数据标注中常用的主动学习方法与真实任务目标不匹配,并提出应转向传导式标注——即高效地标注整个数据池,而非仅关注预测精度,尤其针对稀有类别(如罕见物种)的“发现”难题,同时设计了一种结合发现准则的混合停止策略来避免过早停止标注,从而显著提升稀有类别的发掘效果。
Active learning is now standard practice in labeling ecological data, enabling ecologists to quickly process large volumes of field data to understand and monitor natural environments. Current practices evaluate active learning inductively, estimating predictive performance on a held-out test set. We argue that this evaluation is misaligned with most ecological tasks, where the goal is to transductively label an entire pool of data as efficiently as possible. We demonstrate that ignoring the human-in-the-loop underestimates the importance of continuing to label, particularly for classes in the long tail which may be of disproportionate ecological importance (rare species, uncommon behaviors, etc.). Our analysis shows that, for this long tail, the transductive objective shifts importance from prediction to discovery: the true challenge becomes finding "needles in the haystack," examples of rare classes that are embedded within dense regions of abundant classes in the latent geometry, which we quantify with a novel metric of sampling difficulty. Finally, to translate these insights to practical ecological workflows, we propose a conservative hybrid stopping criterion inspired by ecological rarefaction curves, and show that combining predictive performance with discovery criteria reduces premature stopping on long-tailed pools, improving rare-class recovery when discovery, not classification, is the limiting factor.
在干草堆中寻找针:生态学中的传导式主动标注 / Finding Needles in the Haystack: Transductive Active Labeling in Ecology
本文指出当前生态学数据标注中常用的主动学习方法与真实任务目标不匹配,并提出应转向传导式标注——即高效地标注整个数据池,而非仅关注预测精度,尤其针对稀有类别(如罕见物种)的“发现”难题,同时设计了一种结合发现准则的混合停止策略来避免过早停止标注,从而显著提升稀有类别的发掘效果。
源自 arXiv: 2606.03821