VirTex is a pretraining approach which uses semantically dense captions to learn visual representations. We train CNN + Transformers from scratch on COCO Captions, and transfer the CNN to downstream ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results