A new agentic framework, Staged Executable Inverse Graphics (SEIG), has been developed to address the challenge of reconstructing editable 3D scenes from single images. Unlike traditional methods, SEIG leverages pretrained vision-language models (VLMs) to generate executable Blender programs, directly defining scene factors such as geometry, materials, composition, and lighting. The framework progressively refines these elements within the Blender code space. Evaluations across diverse scenes using various reconstruction metrics, including pixel-level, perceptual, and semantic fidelity, indicate that this staged reconstruction approach significantly enhances fidelity. This highlights the effectiveness of task decomposition when employing general-purpose VLMs for executable inverse graphics. The researchers also demonstrate several downstream applications made possible by the reconstructed editable Blender scenes.
Featured on AI Radar: Staged Executable Inverse Graphics (SEIG) with Vision-Language Models in Blender