Abstract: Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to ...
We find a commonality of various dirty samples is visual-linguistic inconsistency between images and associated labels. To capture the semantic inconsistency between modalities, we propose versatile ...
remove-circle Internet Archive's in-browser bookreader "theater" requires JavaScript to be enabled. It appears your browser does not have it turned on. Please see ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results