Vinay-Umrethe/SigMamba-V1

GitHub: Vinay-Umrethe/SigMamba-V1

该项目将 SigLIP 2 视觉编码器与 Mamba SSM 结合，以线性计算复杂度实现高精度、高效率的长视频异常事件检测。

Stars: 1 | Forks: 0

# SigMamba-V1：使用 SigLIP 2 和 Mamba SSM 的统一视频异常检测

一种将 **SigLIP 2**（Google 的 SOTA 视觉编码器）与 **Mamba**（线性复杂度状态空间模型）相结合的统一架构，用于检测监控视频中的异常情况。该系统实现了线性 $O(N)$ 扩展性，使得处理长篇视频内容成为可能，而这在使用具有二次方复杂度开销的 Transformers 时 previously 是不切实际的。 # 模型对比 ### 基准测试 | 指标 | V1 (Large) | V1 (Small) | | :--- | :--- | :--- | | **AUC-ROC** | **89.82%** | 87.57% | | **Average Precision** | **41.05%** | 32.04% | | **Best F1-Score** | 41.18% | **41.90%** | | **Inference FPS** | 1,022 | **3,242** | | **Peak VRAM** | 5,148 MB | **3,207 MB** | ## 核心功能 - **线性复杂度**：通过 Mamba SSM 实现 $O(N)$ 扩展性（相比之下，Transformers 的复杂度为 $O(N^2)$） - **双输入模式**：接受原始像素或预提取的特征 ## 架构该模型在两种模式下运行： | 模式 | 输入 | 使用场景 | |------|-------|----------| | **统一模式** | `pixel_values` (B, T, 3, 384, 384) | 端到端推理 | | **模块化模式** | `features` (B, T, 1024) | 训练 / 批处理 | **该架构在序列长度上实现了线性计算复杂度 $O(N)$，能够实时处理长监控视频，同时保持高检测精度。** ``` graph LR subgraph "Preprocessing Stage" A[Raw Video] --> B[SigLIP2 Vision Encoder] B --> C[Extracted Features] end subgraph "Training Stage" C --> D[Dataset Loader] D --> E[Mamba Encoder] E --> F[MIL Head] F --> G[RTFM Loss] end subgraph "Inference Stage" C --> H[Trained Model] H --> I[Anomaly Scores 0-100%] end ``` ## 超参数 ### Large | 参数 | 值 | 描述 | |-----------|-------|-------------| | Feature Dim | 1024 | SigLIP 输出维度 | | Mamba d_model | 768 | 内部隐藏维度 | | Mamba Depth | 8 | 堆叠层数 | ## Small | 参数 | 值 | 描述 | |-----------|-------|-------------| | Feature Dim | 768 | SigLIP 输出维度 | | Mamba d_model | 768 | 内部隐藏维度 | | Mamba Depth | 8 | 堆叠层数 | **SigMamba：** Mamba 模型约有 ~32M 个参数。 ## 用法 #### 前置条件 ``` pip install opencv-python pip install transformers==4.57.3 ``` ### 加载模型 ``` from transformers import AutoModel, AutoProcessor import torch model = AutoModel.from_pretrained( "VINAY-UMRETHE/SigMamba-V1-Large", trust_remote_code=True ) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) model.eval() processor = AutoProcessor.from_pretrained(model.config.vision_model_id) ``` ## 推理模式 1：统一模式（原始像素 → 分数）当您拥有原始视频文件时请使用此模式。模型会在内部处理特征提取。 ### 示例：单个视频 ``` import cv2 import numpy as np def load_video_frames(video_path, num_frames=32): """Sample frames uniformly from a video.""" cap = cv2.VideoCapture(video_path) total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) indices = np.linspace(0, total_frames - 1, num_frames, dtype=int) frames = [] for idx in indices: cap.set(cv2.CAP_PROP_POS_FRAMES, idx) ret, frame = cap.read() if ret: frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) frames.append(frame) cap.release() return frames frames = load_video_frames("test_video.mp4", num_frames=32) inputs = processor(images=frames, return_tensors="pt") pixel_values = inputs.pixel_values.to(device) pixel_values = pixel_values.unsqueeze(0) # 推理。 with torch.no_grad(): scores = model(pixel_values=pixel_values) # 获取结果。 anomaly_scores = scores.squeeze().cpu().numpy() max_score = anomaly_scores.max() print(f"Max Anomaly Score: {max_score:.4f}") ``` ## 推理模式 2：模块化模式（预提取特征 → 分数）当您已经提取了特征用于训练模型时，请使用此模式。 ### 示例：从特征文件加载 ``` def load_features_from_txt(feature_path): """Load features from text file (one line per segment).""" with open(feature_path, 'r') as f: lines = f.readlines() features = [] for line in lines: values = [float(v) for v in line.strip().split()] features.append(values) return torch.tensor(features, dtype=torch.float32) # 加载 features。 features = load_features_from_txt("video_features.txt") features = features.unsqueeze(0).to(device) # 推理。 with torch.no_grad(): scores = model(features=features) print(f"Anomaly Scores: {scores.squeeze().cpu().numpy()}") ``` ## 批量处理多个视频在一次前向传递中处理多个视频以提高效率。 ``` # 加载多个视频。 video_paths = ["video1.mp4", "video2.mp4", "video3.mp4"] batch_frames = [] for path in video_paths: frames = load_video_frames(path, num_frames=32) inputs = processor(images=frames, return_tensors="pt") batch_frames.append(inputs.pixel_values) pixel_values = torch.stack(batch_frames).to(device) with torch.no_grad(): scores = model(pixel_values=pixel_values) for i, path in enumerate(video_paths): max_score = scores[i].max().item() print(f"{path}: {max_score:.4f}") ``` ## 单帧用于处理单个帧。 ``` from PIL import Image # 加载单张图像。 image = Image.open("suspicious_frame.jpg") inputs = processor(images=image, return_tensors="pt") pixel_values = inputs.pixel_values.to(device) pixel_values = pixel_values.unsqueeze(0) with torch.no_grad(): score = model(pixel_values=pixel_values) print(f"Frame Anomaly Score: {score.item():.4f}") ``` ## 仅提取特征（无分类）直接访问 Mamba 编码器输出，用于自定义的下游任务。 ``` # 加载 frames。 frames = load_video_frames("video.mp4", num_frames=32) inputs = processor(images=frames, return_tensors="pt") pixel_values = inputs.pixel_values.unsqueeze(0).to(device) # 访问 internal components。 with torch.no_grad(): # Step 1: Extract vision features. b, t, c, h, w = pixel_values.shape flat_pixels = pixel_values.view(b * t, c, h, w) vision_features = model.vision_model.get_image_features(pixel_values=flat_pixels) vision_features = vision_features / vision_features.norm(dim=-1, keepdim=True) vision_features = vision_features.view(b, t, -1) # Step 2: Get Mamba-encoded features. mamba_features = model.mamba_encoder(vision_features) print(f"Vision Features: {vision_features.shape}") print(f"Mamba Features: {mamba_features.shape}") ``` ## 基于阈值的检测应用阈值将分数转换为二元预测。 ``` def detect_anomalies(video_path, threshold=0.5): """Returns list of anomalous segment indices.""" frames = load_video_frames(video_path, num_frames=32) inputs = processor(images=frames, return_tensors="pt") pixel_values = inputs.pixel_values.unsqueeze(0).to(device) with torch.no_grad(): scores = model(pixel_values=pixel_values) scores = scores.squeeze().cpu().numpy() anomalous_segments = np.where(scores > threshold)[0] return { "scores": scores, "max_score": scores.max(), "is_anomalous": scores.max() > threshold, "anomalous_segments": anomalous_segments.tolist() } # 推理。 result = detect_anomalies("test.mp4", threshold=0.5) print(f"Anomalous: {result['is_anomalous']}") print(f"Segments: {result['anomalous_segments']}") ``` ## 输出参考 | 方法 | 输入形状 | 输出形状 | 描述 | |:---|:---|:---|:---| | `model(pixel_values=...)` | `(B, T, C, H, W)` | `(B, T, O)` | 端到端原始帧推理 | | `model(features=...)` | `(B, T, D)` | `(B, T, O)` | 使用预提取特征进行训练 | | `model.mamba_encoder(...)` | `(B, T, D)` | `(B, T, M)` | 通过 Mamba 处理特征 | | `model.vision_encoder(...)` | `(N, C, H, W)` | `(N, D)` | 从帧中提取 SigLIP 特征 | 其中： * **B**：Batch size * **T**：视频序列长度（总帧数） * **N**：展平后的总帧数 (B x T) * **C**：图像通道数 (3) * **H, W**：帧的高度和宽度 (384) * **O**：输出异常维度 (1) * **M**：Mamba 内部隐藏维度 (768) * **D**：视觉特征嵌入维度（Large 为 1024，Small 为 768） ## 许可条款 SigMamba-V1：使用 SigLIP 2 和 Mamba SSM 的统一视频异常检测 Copyright © 2026 Vinay Umrethe。本程序为自由软件：您可以根据自由软件基金会发布的 GNU Affero General Public License 条款重新分发和/或修改它，选择适用该许可证的第 3 版或（由您选择的）任何更高版本。发布本程序是希望它能够派上用场，但对此不作任何保证；甚至不包含对适销性或特定用途适用性的暗示保证。有关更多详细信息，请参阅 GNU Affero General Public License。您应该已经随本程序收到了一份 GNU Affero General Public License 的副本。如果没有，请访问。所有模型权重、生成的输出以及建模代码（位于 `sigmamba_release/` 目录中）均基于 **MIT License** 获得许可。 ## 参考文献 [1] Tschannen, Michael, et al. "Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features." arXiv preprint arXiv:2502.14786 (2025). [2] Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." First conference on language modeling. 2024. [3] Zhang, Yiling, Erkut Akdag, and Egor Bondarev. "MTFL: multi-timescale feature learning for weakly-supervised anomaly detection in surveillance videos." Seventeenth International Conference on Machine Vision (ICMV 2024). Vol. 13517. SPIE, 2025. ## 引用 ``` @misc{vinayumrethesigmamba2026, title = {SigMamba: Unified Video Anomaly Detection using SigLIP 2 and Mamba SSM}, author = {Vinay Umrethe}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/collections/VINAY-UMRETHE/sigmamba-inventory}} } ```

标签：Mamba, PyTorch, SigLIP, 凭据扫描, 状态空间模型, 视频异常检测, 计算机视觉, 逆向工具