Current vision-language models (VLMs) typically rely on a modular architecture, stitching together distinct image encoders and language decoders. This multi-stage alignment can fragment pixel-level signals and scatter early pixel-word interactions. In contrast, native VLMs, while showing promise for single images, have not been extensively explored for multi-image, video understanding, or spatial intelligence tasks.
To address these limitations, researchers have developed NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end. This model operates without external encoders, auxiliary adapters, or post-hoc fusion, thereby eliminating module boundaries. This design allows for fine-grained and unified spatiotemporal modeling to emerge natively within the model architecture.
Empirical results indicate that NEO-ov significantly narrows the performance gap with modular counterparts and demonstrates strong capabilities in fine-grained visual perception. This suggests that native "one-vision" architectures are not only feasible but can be competitive at scale. The research also provides architectural analyses and detailed training recipes to support further development in native multimodal modeling. The code and models for NEO-ov are publicly available on GitHub.