Abstract: Knowledge-based Visual Question Answering (VQA) is a challenging task that requires models to access external knowledge for reasoning. Large Language Models (LLMs) have recently been ...
Abstract: Open-world interpretation aims to accurately localize and recognize all objects within images by vision-language models (VLMs). While substantial progress has been made in this task for ...