Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

📄 Abstract - Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

面对面：用于多人交互建模的视频数据集 / Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

1️⃣ 一句话总结

这篇论文发布了一个名为F2F-JF的新视频数据集，专门用于研究两人对话中的互动与反应时序，并通过一个生成数字主持人的任务展示了该数据集如何帮助AI模型更好地理解和模拟人际交流中的动态响应。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要