Automated alternatives

Best Towards Robustness against Typographic Attack with Training-free Concept Localization alternatives.

Live source-backed alternatives to Towards Robustness against Typographic Attack with Training-free Concept Localization for Vision-language. Alternatives are selected from the same task category and update whenever the best-of index rebuilds.

Alternatives

same task category

Sources

distinct URLs

Modules

indexable

Updated

Jul 2, 2026

from radar data

Reference option

Towards Robustness against Typographic Attack with Training-free Concept Localization

Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR. cs.CV Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR. Research signal collected from arXiv metadata; Gemini enrichment can add a clearer summary. cs.CV cs.CL

RDR78Research-onlyarxiv-ai

Alternative

NVIDIA NIM Model Catalog

Matched vision-language, vision language, multimodal; 3 source links; official inference catalog signal; access model: Free endpoint

RDR83Free endpoint

Alternative

Hugging Face Inference Providers

Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API

RDR80Paid API

#	Alternative	Kind	Access	Fit	Why it appears	Source
01	NVIDIA NIM Model Catalog	service	Free endpoint	RDR83	Matched vision-language, vision language, multimodal; 3 source links; official inference catalog signal; access model: Free endpoint	build.nvidia.com
02	Hugging Face Inference Providers	service	Paid API	RDR80	Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API	huggingface.co
03	Fireworks AI Serverless Models	service	Paid API	RDR79	Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API	docs.fireworks.ai
04	Together AI Serverless Models	service	Paid API	RDR79	Matched vision-language, vision language, multimodal; 2 source links; official inference catalog signal; access model: Paid API	docs.together.ai
05	amalia-llm/MATH-Vision-PT	model	Open weights	RDR78	Matched vision-language, vision language, image-to-text; 2 source links; access model: Open weights; freshly updated	huggingface.co
06	RSICCLLM: A Multimodal Large Language Model for Remote Sensing Image Change Captioning	paper	Research-only	RDR75	Matched vision-language, vision language, multimodal; 2 source links; access model: Research-only; freshly updated	arxiv.org
07	Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models	paper	Research-only	RDR74	Matched vision-language, vision language, multimodal; 1 source link; access model: Research-only	arxiv.org

Custom alerts

Track Towards Robustness against Typographic Attack with Training-free Concept Localization alternatives

Get private alerts when source-backed vision-language alternatives, access signals, or comparison evidence change.

API and bulk access

Towards Robustness against Typographic Attack with Training-free Concept Localization

NVIDIA NIM Model Catalog

Hugging Face Inference Providers

Vision-language decision paths

Track Towards Robustness against Typographic Attack with Training-free Concept Localization alternatives