菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-27
📄 Abstract - Refining Multidimensional Video Reward Models via Disentangled Influence Functions

As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional Video Reward Models (MVRMs), which decompose the evaluation process to better align with the multifaceted nature of human visual perception. However, training effective MVRMs is fundamentally challenged by the complex nature of video data. In this work, we identify a critical phenomenon termed Dimensional Heterogeneity: the reliability of a training sample can vary substantially across evaluation dimensions, meaning that a sample may provide reliable supervision for one objective while inducing high supervision risk for another. Consequently, prevailing data-centric methods that filter based on global scalar metrics are ill-posed for T2V tasks. To address this, we propose a disentangled influence framework that that efficiently estimates dimension-specific supervision risk. Leveraging this framework, we introduce two dimension-disentangled refinement strategies: Dimension-Disentangled Pruning, which removes extreme high-risk samples, and Dimension-Disentangled Reweighting, which softly down-weights high-risk supervision. Extensive experiments demonstrate that our disentangled strategies significantly outperform global filtering baselines, yielding reward models with superior alignment to ground truth.

顶级标签: multi-modal model evaluation machine learning
详细标签: text-to-video reward model disentangled influence supervision risk dimensional heterogeneity 或 搜索:

通过解耦影响函数优化多维视频奖励模型 / Refining Multidimensional Video Reward Models via Disentangled Influence Functions


1️⃣ 一句话总结

本文提出一种新框架,通过识别训练样本在不同评估维度上的可靠性差异(称为“维度异质性”),并利用解耦影响函数分别处理每个维度的风险,从而更精准地优化视频奖励模型,使其在评价视频质量时更贴合人类主观感受。

源自 arXiv: 2605.28203