Staged Executable Inverse Graphics (SEIG) with Vision-Language Models in Blender

Researchers have introduced Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs 3D scenes from single images directly into editable Blender programs using pretrained vision-language models (VLMs). This method refines scene elements like geometry, materials, and lighting in Blender code, demonstrating improved reconstruction fidelity through task decomposition.

RDR72Confidence 85%inverse graphics3D reconstructionBlendervision-language modelsagentic AI

Why it matters

This research explores a novel approach to inverse graphics, enabling the reconstruction of editable 3D scenes from single images using general-purpose VLMs and Blender. By generating executable Blender code, it offers a pathway for more accessible 3D content creation and manipulation without specialized 2D/3D foundation models or complex rendering techniques.

A new agentic framework, Staged Executable Inverse Graphics (SEIG), has been developed to address the challenge of reconstructing editable 3D scenes from single images. Unlike traditional methods, SEIG leverages pretrained vision-language models (VLMs) to generate executable Blender programs, directly defining scene factors such as geometry, materials, composition, and lighting. The framework progressively refines these elements within the Blender code space. Evaluations across diverse scenes using various reconstruction metrics, including pixel-level, perceptual, and semantic fidelity, indicate that this staged reconstruction approach significantly enhances fidelity. This highlights the effectiveness of task decomposition when employing general-purpose VLMs for executable inverse graphics. The researchers also demonstrate several downstream applications made possible by the reconstructed editable Blender scenes.

Article ID - cmpweuh6n0Featured on AI Radar: Staged Executable Inverse Graphics (SEIG) with Vision-Language Models in Blender