LIVE-Last scan updating-53 sources active-102 signals today-AI CODINGAgent Workspace Linux: Isolated Desktop for AI Agents
Automated alternatives

Best Towards Robustness against Typographic Attack with Training-free Concept Localization alternatives.

Live source-backed alternatives to Towards Robustness against Typographic Attack with Training-free Concept Localization for Vision-language. Alternatives are selected from the same task category and update whenever the best-of index rebuilds.

Alternatives
7
same task category
Sources
15
distinct URLs
Modules
6
indexable
Updated
Jul 2, 2026
from radar data
Reference option

Towards Robustness against Typographic Attack with Training-free Concept Localization

Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR. cs.CV Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR. Research signal collected from arXiv metadata; Gemini enrichment can add a clearer summary. cs.CV cs.CL

RDR78Research-onlyarxiv-ai
Alternative

NVIDIA NIM Model Catalog

Matched vision-language, vision language, multimodal; 3 source links; official inference catalog signal; access model: Free endpoint

RDR83Free endpoint
Alternative

Hugging Face Inference Providers

Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API

RDR80Paid API
#AlternativeKindAccessFitWhy it appearsSource
01NVIDIA NIM Model Catalog serviceFree endpointRDR83Matched vision-language, vision language, multimodal; 3 source links; official inference catalog signal; access model: Free endpointbuild.nvidia.com
02Hugging Face Inference Providers servicePaid APIRDR80Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid APIhuggingface.co
03Fireworks AI Serverless Models servicePaid APIRDR79Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid APIdocs.fireworks.ai
04Together AI Serverless Models servicePaid APIRDR79Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid APIdocs.together.ai
05amalia-llm/MATH-Vision-PTmodelOpen weightsRDR78Matched vision-language, vision language, image-to-text; 2 source links; access model: Open weights; freshly updatedhuggingface.co
06RSICCLLM: A Multimodal Large Language Model for Remote Sensing Image Change CaptioningpaperResearch-onlyRDR75Matched vision-language, vision language, multimodal; 2 source links; access model: Research-only; freshly updatedarxiv.org
07Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal ModelspaperResearch-onlyRDR74Matched vision-language, vision language, multimodal; 1 source link; access model: Research-onlyarxiv.org
Custom alerts

Track Towards Robustness against Typographic Attack with Training-free Concept Localization alternatives

Get private alerts when source-backed vision-language alternatives, access signals, or comparison evidence change.

API and bulk access
Topics
Choose segments and get a private RSS feed plus preference link.