Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

📄 Abstract - Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

Job-search platforms rely on low-bandwidth query interfaces that often fail to capture the high-dimensional complexity of candidate profiles. We present an end-to-end RLAIF (Reinforcement Learning from AI Feedback) framework to generate \emph{portable} job search queries, terms that abstract away seeker-specific identifiers while preserving generalizable qualifications. This task introduces a highly adversarial reward surface where policy optimization frequently exploits flaws in LLM-as-judge rubrics, resulting in degenerate verbatim-copying behaviors. We conducted comprehensive empirical experiments to isolate the impact of optimization mechanics against structured reward engineering. Our results demonstrate that for critic-free optimizers, performance is overwhelmingly dictated by robust reward shaping, rendering the specific choice of algorithm largely immaterial. While critic-free per-rollout baseline methods (RLOO and REINFORCE++) natively resist reward-hacking, the group-relative advantage normalization in GRPO appears uniquely sensitive to spurious reward signals, making it disproportionately susceptible to exploitation. We show that introducing a deterministic, rule-based reward floor to correct for rewards assigned to verbatim copying mitigates this failure mode, resulting in a substantial $+0.147$ quality improvement on a cross-family evaluation judge. Ultimately, we show that the training-time reward model inflates performance gains by $2.4\times$, confirming that the training success is fundamentally dependent on enforcing reward-shaping disciplines rather than selecting alternative optimizers.

为便携式查询生成设计奖励信号：工业语义化职位搜索案例研究 / Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

1️⃣ 一句话总结

本文提出了一种基于AI反馈的强化学习框架，用于在工业职位搜索中自动生成能屏蔽求职者身份信息、保留通用资格条件的搜索关键词，并通过引入规则化的奖励下限来防止AI奖励模型被利用（如直接复制原文），从而显著提升查询质量。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要