📄
Abstract - Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search
Job-search platforms rely on low-bandwidth query interfaces that often fail to capture the high-dimensional complexity of candidate profiles. We present an end-to-end RLAIF (Reinforcement Learning from AI Feedback) framework to generate \emph{portable} job search queries, terms that abstract away seeker-specific identifiers while preserving generalizable qualifications. This task introduces a highly adversarial reward surface where policy optimization frequently exploits flaws in LLM-as-judge rubrics, resulting in degenerate verbatim-copying behaviors. We conducted comprehensive empirical experiments to isolate the impact of optimization mechanics against structured reward engineering. Our results demonstrate that for critic-free optimizers, performance is overwhelmingly dictated by robust reward shaping, rendering the specific choice of algorithm largely immaterial. While critic-free per-rollout baseline methods (RLOO and REINFORCE++) natively resist reward-hacking, the group-relative advantage normalization in GRPO appears uniquely sensitive to spurious reward signals, making it disproportionately susceptible to exploitation. We show that introducing a deterministic, rule-based reward floor to correct for rewards assigned to verbatim copying mitigates this failure mode, resulting in a substantial $+0.147$ quality improvement on a cross-family evaluation judge. Ultimately, we show that the training-time reward model inflates performance gains by $2.4\times$, confirming that the training success is fundamentally dependent on enforcing reward-shaping disciplines rather than selecting alternative optimizers.
为便携式查询生成设计奖励信号:工业语义化职位搜索案例研究 /
Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search
1️⃣ 一句话总结
本文提出了一种基于AI反馈的强化学习框架,用于在工业职位搜索中自动生成能屏蔽求职者身份信息、保留通用资格条件的搜索关键词,并通过引入规则化的奖励下限来防止AI奖励模型被利用(如直接复制原文),从而显著提升查询质量。