While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a ...
Abstract: The field of Large Visual-Language Models (LVLMs) has made significant strides in integrating visual recognition and language understanding. However, its application in multimodal ...
Abstract: Document Information Extraction aims to extract entities and relationships from visually rich documents. Traditional methods require significant annotation and lack generality. In this paper ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results