RoadTones: Tone Controllable Text Generation from Road Event Videos

📄 Abstract - RoadTones: Tone Controllable Text Generation from Road Event Videos

Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.

RoadTones：从道路事件视频生成语气可控的文本 / RoadTones: Tone Controllable Text Generation from Road Event Videos

1️⃣ 一句话总结

本文提出了一套包含数据集、模型和评估方法的完整方案，使AI能够根据道路事件视频生成语气可调节的文本描述，例如“紧急”或“中性”语气，从而让视频描述不仅准确，还能根据沟通需求调整表达方式。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要