NEO-ov: A Native One-Vision Foundation Model for End-to-End Spatiotemporal Modeling

Researchers have introduced NEO-ov, a native foundation model designed to learn cross-frame and pixel-word correspondence end-to-end. Unlike traditional vision-language models (VLMs) that combine separate encoders and decoders, NEO-ov eliminates module boundaries to enable unified spatiotemporal modeling, aiming to improve fine-grained visual perception across multiple images and videos.

RDR80Confidence 85%vision-language modelsmultimodal AIfoundation modelsspatiotemporal modelingcomputer vision

Why it matters

NEO-ov represents a step towards more integrated vision-language models by removing the modular fragmentation common in current VLMs. This approach could lead to more efficient and accurate processing of visual information, particularly in complex scenarios involving multiple images, video understanding, and spatial intelligence, potentially influencing future multimodal AI development.

Current vision-language models (VLMs) typically rely on a modular architecture, stitching together distinct image encoders and language decoders. This multi-stage alignment can fragment pixel-level signals and scatter early pixel-word interactions. In contrast, native VLMs, while showing promise for single images, have not been extensively explored for multi-image, video understanding, or spatial intelligence tasks.

To address these limitations, researchers have developed NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end. This model operates without external encoders, auxiliary adapters, or post-hoc fusion, thereby eliminating module boundaries. This design allows for fine-grained and unified spatiotemporal modeling to emerge natively within the model architecture.

Empirical results indicate that NEO-ov significantly narrows the performance gap with modular counterparts and demonstrates strong capabilities in fine-grained visual perception. This suggests that native "one-vision" architectures are not only feasible but can be competitive at scale. The research also provides architectural analyses and detailed training recipes to support further development in native multimodal modeling. The code and models for NEO-ov are publicly available on GitHub.

Article ID - cmpp12dfg0Featured on AI Radar: NEO-ov: A Native One-Vision Foundation Model for End-to-End Spatiotemporal Modeling