VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images
Abstract
A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates A:B::C:?, where images B and a hidden target image D are produced by applying the same deterministic transformation sequence to source images A and C. Given A, B, and C, a model must answer a multiple-choice question about D. The benchmark contains 617 human-validated questions spanning one- to four-step transformations such as zoom, quadrant swap, rotation, flip, and hue rotation. Across strong proprietary and open-source VLMs, end-to-end accuracy is substantially lower than oracle accuracy when D is directly shown, and degrades sharply as transformation depth increases, while human performance remains near the ceiling. A program-conditioned evaluation further separates failures of relation inference from failures of transformation application, showing that inferring the visual relation from A to B is the dominant bottleneck, with additional application errors emerging on harder multi-step cases.
Citation
@article{li2026visanalog,
title={VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images},
author={Li, Zhaonan and Chickering, Kyle R and Li, Bangzheng and Dineen, Jacob and Ye, Xiao and Xu, Zhikun and Lu, Shijie and Huang, Yuxi and Shen, Ming and Nguyen, Bach and others},
journal={arXiv preprint arXiv:2605.23141},
year={2026}
}