While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a ...
Abstract: The field of Large Visual-Language Models (LVLMs) has made significant strides in integrating visual recognition and language understanding. However, its application in multimodal ...
Abstract: Document Information Extraction aims to extract entities and relationships from visually rich documents. Traditional methods require significant annotation and lack generality. In this paper ...