Abstract: This paper proposes a model-level fusion-based multi-modal object detection and recognition method. This method employs various modalities to process images, speech, videos, etc., and fuses ...