📄
Abstract - Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning
Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.
基于正则化微调的3D视觉-语言模型领域泛化适应 /
Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning
1️⃣ 一句话总结
本文提出了一种名为ReFine3D的微调框架,通过选择性地调整模型层并引入两种正则化策略——确保不同视角点云的一致性,以及利用同义词替换增强文本多样性——使得3D多模态模型在数据有限的场景下既能适应新任务,又不会遗忘原有能力,最终在多项测试中提升了模型对新类别、新数据集和干扰情况的泛化性能。