Why it matters
This research explores a novel approach to inverse graphics, enabling the reconstruction of editable 3D scenes from single images using general-purpose VLMs and Blender. By generating executable Blender code, it offers a pathway for more accessible 3D content creation and manipulation without specialized 2D/3D foundation models or complex rendering techniques.

A new agentic framework, Staged Executable Inverse Graphics (SEIG), has been developed to address the challenge of reconstructing editable 3D scenes from single images. Unlike traditional methods, SEIG leverages pretrained vision-language models (VLMs) to generate executable Blender programs, directly defining scene factors such as geometry, materials, composition, and lighting. The framework progressively refines these elements within the Blender code space. Evaluations across diverse scenes using various reconstruction metrics, including pixel-level, perceptual, and semantic fidelity, indicate that this staged reconstruction approach significantly enhances fidelity. This highlights the effectiveness of task decomposition when employing general-purpose VLMs for executable inverse graphics. The researchers also demonstrate several downstream applications made possible by the reconstructed editable Blender scenes.

Share:XHacker NewsLink
Article ID - cmpweuh6n0Featured on AI Radar: Staged Executable Inverse Graphics (SEIG) with Vision-Language Models in Blender