面对面:用于多人交互建模的视频数据集 / Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling
1️⃣ 一句话总结
这篇论文发布了一个名为F2F-JF的新视频数据集,专门用于研究两人对话中的互动与反应时序,并通过一个生成数字主持人的任务展示了该数据集如何帮助AI模型更好地理解和模拟人际交流中的动态响应。
Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.
面对面:用于多人交互建模的视频数据集 / Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling
这篇论文发布了一个名为F2F-JF的新视频数据集,专门用于研究两人对话中的互动与反应时序,并通过一个生成数字主持人的任务展示了该数据集如何帮助AI模型更好地理解和模拟人际交流中的动态响应。
源自 arXiv: 2603.14794