🎯 SAM 2++: Tracking Anything at Any Granularity


🌟 Overview

A unified video tracking foundation model that can track object at any granularity, including masks, bounding boxes, and points.

Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.



overview figure

 


🎥 Demos produced by our SAM 2++

Video Object Segmentation: mask ❤️ 💛 💙

DAVIS 2017 LVOS v2 MOSE
Tracking-Any-Granularity VISOR



Single Object Tracking: box 🟥 🟨 🟦

GOT 10k LaSOT NFS
Tracking-Any-Granularity UAV123



Point Tracking: point 🔴 🟠 🔵

TAP-Vid RGB-Stacking TAP-Vid DAVIS TAP-Vid RoboTAP
Tracking-Any-Granularity PerceptionTest

 


🗃️ Example Videos from our dataset Tracking-Any-Granularity




Analysis of Statistics and Attributes

attr figure


Comparison of our datasets with public datasets

data_compare figure

 


🏆 Comparisons with existing video tracking methods

Video Object Segmentation

comparisons/comp1 figure

Single Object Tracking

comparisons/comp2 figure

Point Tracking

comparisons/comp3 figure