MSVMamba is a visual state space model that introduces a hierarchy in hierarchy design to the VMamba model. This repository contains the code for training and evaluating MSVMamba models on the ...
Abstract: Landslides are one of the most destructive natural disasters in the world, threatening human life and safety. With excellent performance as a foundation model for image segmentation, the ...
Leveraging the extensive training data from SA-1B, the segment anything model (SAM) demonstrates remarkable generalization and zero-shot capabilities. However, as a category-agnostic instance ...
CLIP is one of the most important multimodal foundational models today, aligning visual and textual signals into a shared feature space using a simple contrastive learning loss on large-scale ...
We introduce Visual Reinforcement Fine-tuning (Visual-RFT), the first comprehensive adaptation of Deepseek-R1’s RL strategy to the multimodal field. We use the Qwen2-VL-2/7B model as our base model ...
Can you chip in? As an independent nonprofit, the Internet Archive is fighting for universal access to quality information. We build and maintain all our own systems, but we don’t charge for access, ...