菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-09
📄 Abstract - Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs -- Gemma 3 and Qwen3-VL -- to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models' reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.

顶级标签: computer vision natural language processing multi-modal
详细标签: vision language models digital twins agriculture in-context learning 3d reconstruction 或 搜索:

利用视觉语言基础模型通过上下文学习生成植物仿真配置 / Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning


1️⃣ 一句话总结

这项研究首次提出利用先进的视觉语言模型,根据无人机遥感图像直接生成植物三维仿真的结构化参数,为农业数字孪生提供了一种可扩展的新方法,但模型性能会受到上下文偏见和视觉线索不足的影响。

源自 arXiv: 2603.08930