LIVE-Last scan updating-53 sources active-129 signals today-RESEARCH PGaussDet: Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors
Automated alternatives

Best Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models alternatives.

Live source-backed alternatives to Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models for Vision-language. Alternatives are selected from the same task category and update whenever the best-of index rebuilds.

Alternatives
7
same task category
Sources
15
distinct URLs
Modules
6
indexable
Updated
Jun 26, 2026
from radar data
Reference option

Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models

Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolving LMMs optimize answer agreement without ensuring the decoder attends to visual content, relying instead on statistical language priors to produce self consistent outputs. This leads to a persistent failure mode we term visual under-conditioning, where the decoder relies on language priors rather than the image during generation, manifesting as insufficient attention to visual tokens. As a result, current self-evolving LMMs struggle on vision--language understanding tasks such as image captioning and visual question answering. To address this, we propose VISE (Visual Invariance Self-Evolution), a purely unsupervised self-evolving framework that directly regularizes the model's visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. VISE operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images. Experiments on 18 benchmarks demonstrate the efficacy of our approach. Using Qwen3-VL-2B as the base model, VISE achieves gains of $+16.85$ CIDEr on COCO and $+19.66$ CIDEr on TextCaps, reduces object hallucination by $5.0$ Chair-I points, and generalizes across four model families and scales. Our code and models are available at https://mbzuai-oryx.github.io/VISE cs.CV Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolving LMMs optimize answer agreement without ensuring the decoder attends to visual content, relying instead on statistical language priors to produce self consistent outputs. This leads to a persistent failure mode we term visual under-conditioning, where the decoder relies on language priors rather than the image during generation, manifesting as insufficient attention to visual tokens. As a result, current self-evolving LMMs struggle on vision--language understanding tasks such as image captioning and visual question answering. To address this, we propose VISE (Visual Invariance Self-Evolution), a purely unsupervised self-evolving framework that directly regularizes the model's visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. VISE operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images. Experiments on 18 benchmarks demonstrate the efficacy of our approach. Using Qwen3-VL-2B as the base model, VISE achieves gains of $+16.85$ CIDEr on COCO and $+19.66$ CIDEr on TextCaps, reduces object hallucination by $5.0$ Chair-I points, and generalizes across four model families and scales. Our code and models are available at https://mbzuai-oryx.github.io/VISE Research signal collected from arXiv metadata; Gemini enrichment can add a clearer summary. cs.CV benchmark

RDR75Research-onlyarxiv-ai
Alternative

NVIDIA NIM Model Catalog

Matched vision-language, vision language, multimodal; 3 source links; official inference catalog signal; access model: Free endpoint

RDR84Free endpoint
Alternative

Hugging Face Inference Providers

Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API

RDR81Paid API
#AlternativeKindAccessFitWhy it appearsSource
01NVIDIA NIM Model Catalog serviceFree endpointRDR84Matched vision-language, vision language, multimodal; 3 source links; official inference catalog signal; access model: Free endpointbuild.nvidia.com
02Hugging Face Inference Providers servicePaid APIRDR81Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid APIhuggingface.co
03A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder ArchitecturespaperResearch-onlyRDR80Matched vision-language, vision language, vlm; 1 source link; access model: Research-onlyarxiv.org
04Fireworks AI Serverless Models servicePaid APIRDR80Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid APIdocs.fireworks.ai
05Together AI Serverless Models servicePaid APIRDR80Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid APIdocs.together.ai
06SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cmpaperResearch-onlyRDR75Matched vision-language, vision language, multimodal; 2 source links; access model: Research-onlyarxiv.org
07RSICCLLM: A Multimodal Large Language Model for Remote Sensing Image Change CaptioningpaperResearch-onlyRDR75Matched vision-language, vision language, multimodal; 2 source links; access model: Research-only; freshly updatedarxiv.org