TaiwanVQA: A Visual Question Answering Benchmark for Taiwanese Daily Life

Hsin-Yi Hsieh ^†

Shang-Wei Liu ^†

Chang-Chih Meng ^‡

Chien-Hua Chen ^‡

Shuo-Yueh Lin ^§

Hung-Ju Lin ^*

Hen-Hsen Huang ^¶

I Chen Wu ^‡

EvalMG25 @ COLING 2025

^†National Center for High-performance Computing, Taiwan, ^§National Central University, Taiwan, ^‡National Yang Ming Chiao Tung University, Taiwan, ^*National Taiwan University, Taiwan, ^¶Institute of Information Science, Academia Sinica, Taiwan

Github Huggingface

An illustration of the TaiwanVQA Benchmark. Each image in the dataset is paired with two questions (one recognition and one reasoning) covering various aspects of Taiwanese daily life and culture.

Abstract

We introduce TaiwanVQA, a novel visual question answering benchmark designed to evaluate vision-language models’ (VLMs) ability to recognize and reason about Taiwan-specific multimodal content. TaiwanVQA comprises 2,000 image-question pairs covering diverse topics relevant to Taiwanese culture and daily life. We categorize the questions into recognition and reasoning tasks, further sub-classifying reasoning questions based on the level of external knowledge required. We conduct extensive experiments on state-of-the-art VLMs, including GPT-4o, Llama-3.2, LLaVA, Qwen2-VL, and InternVL2 models. Our findings reveal significant limitations in current VLMs when handling culturally specific content. The performance gap widens between recognition tasks (top score 73.60%) and reasoning tasks (top score 49.80%), indicating challenges in cultural inference and contextual understanding. These results highlight the need for more culturally diverse training data and improved model architectures that can better integrate visual and textual information within specific cultural contexts. By providing TaiwanVQA, we aim to contribute to the development of more inclusive and culturally aware AI models, facilitating their deployment in diverse real-world settings. TaiwanVQA can be accessed on our GitHub page.

Introduction

Multimodal vision-language models (VLMs) have achieved remarkable success in integrating visual and textual information. However, most benchmarks focus on general-domain knowledge and widely used languages, overlooking challenges posed by culturally specific content and languages like Traditional Chinese. Understanding and reasoning about nuanced local content is crucial for deploying AI systems worldwide.

To address this gap, we introduce TaiwanVQA, a benchmark evaluating VLMs’ abilities to handle Taiwan-specific content. TaiwanVQA includes questions about local cuisine, festivals, landmarks, and public signage—culturally and contextually rich elements of Taiwanese daily life.

Data and Task Overview

TaiwanVQA consists of 1,000 images and 2,000 questions, evenly split into recognition and reasoning tasks. The reasoning tasks are further classified based on the complexity and external knowledge required, ensuring a multi-dimensional evaluation of model capabilities.

Distribution of Question Categories. The blue and red bars represent recognition and reasoning questions respectively. Darker bars show the total number of questions in the main topic, while lighter bars indicate the number of questions in subtopics. Topics without subtopics, like Daily Necessities, have no lighter bars.

Conclusion

TaiwanVQA highlights the limitations of current VLMs when tackling culturally specific content, particularly in reasoning tasks that demand cultural inference. The benchmark can guide the development of more culturally aware and inclusive AI systems, ultimately improving their applicability in diverse real-world scenarios.

BibTeX

  
  @inproceedings{taiwanvqa,
  title={TaiwanVQA: A Visual Question Answering Benchmark for Taiwanese Daily Life},
  author={Hsin-Yi Hsieh, Shang-Wei Liu, Chang-Chih Meng, Chien-Hua Chen, Shuo-Yueh Lin, Hung-Ju Lin, Hen-Hsen Huang, I-Chen Wu},
  booktitle={EvalMG25 @ COLING 2025},
  year={2024}
}