Abstract: Texts in scene images convey critical information for scene understanding and reasoning. The abilities of reading and rea-soning matter for the model in the text-based visual question ...