Abstract:
Quantifying the associations between images and adjectives, i.e., how much the visual characteristics of an image are connected with a certain adjective, is important for better image understanding. For instance, the appearance of a kitten can be associated with adjectives such as ``soft'', ``small'', and ``cute'' rather than the opposite ``hard'', ``large'', and ``scary''. Thus, giving scores for a kitten photo considering the degree of its association with each antonym adjective pair (termed adjective axis, e.g., ``round'' vs. ``sharp'') aids in understanding the image content and its atmosphere. Existing methods rely on subjective human engagement, making it difficult to estimate the association of images with arbitrary adjective axes in a single framework. To enable the extension to arbitrary axes, we explore the use of large-scale pretrained models, including Large Language Models (LLMs) and Vision Language Models (VLMs). In the proposed training-free framework, users only need to specify a pair of antonym nouns that negatively and positively describe the target axis (e.g., ``roundness'' and ``sharpness''). Evaluation confirms that the proposed framework can predict negative and positive associations between adjectives and images as correctly as the manually-assisted comparative, also highlighting the pros and cons of utilizing the VLM's textual or visual embedding for specific types of adjective axes. Furthermore, computing the similarities among four adjective axes unveils how the proposed framework connects them with each other, such as its tendency to regard a sharp object as being small, hard, and, quick in motion.
Type: 31th Intl. Conf. on MultiMedia Modeling (MMM2025)
Publication date: To be published in Jan 2025