VSSM Encoder¶
Visual State Space Model encoder backbone for RS3Mamba.
models.encoders.vssm_encoder
¶
Visual State Space Model (VSSM) Encoder for RS3Mamba.
This module contains the core Mamba components for 2D vision tasks. The implementation follows VMamba/SwinUMamba but uses mamba-ssm primitives.
Original source: https://github.com/sstary/SSRS/tree/main/RS3Mamba Paper: RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation
PatchEmbed2D(patch_size: int = 4, in_chans: int = 3, embed_dim: int = 96, norm_layer: type[nn.Module] | None = None, **kwargs)
¶
Bases: Module
Image to Patch Embedding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
patch_size
|
int
|
Patch token size. Default: 4. |
4
|
in_chans
|
int
|
Number of input image channels. Default: 3. |
3
|
embed_dim
|
int
|
Number of linear projection output channels. Default: 96. |
96
|
norm_layer
|
type[Module] | None
|
Normalization layer. Default: None |
None
|
Source code in src/models/encoders/vssm_encoder.py
PatchMerging2D(dim: int, norm_layer: type[nn.Module] = nn.LayerNorm)
¶
Bases: Module
Patch Merging Layer for downsampling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Number of input channels. |
required |
norm_layer
|
type[Module]
|
Normalization layer. Default: nn.LayerNorm |
LayerNorm
|
Source code in src/models/encoders/vssm_encoder.py
SS2D(d_model: int, d_state: int = 16, d_conv: int = 3, expand: int = 2, dt_rank: str | int = 'auto', dt_min: float = 0.001, dt_max: float = 0.1, dt_init: str = 'random', dt_scale: float = 1.0, dt_init_floor: float = 0.0001, dropout: float = 0.0, conv_bias: bool = True, bias: bool = False, device: torch.device | None = None, dtype: torch.dtype | None = None, **kwargs)
¶
Bases: Module
Selective Scan 2D - Core Mamba operation for 2D images.
Implements bidirectional scanning in 4 directions for capturing long-range dependencies in 2D feature maps.
Source code in src/models/encoders/vssm_encoder.py
VSSBlock(hidden_dim: int = 0, drop_path: float = 0, norm_layer: Callable[..., nn.Module] = partial(nn.LayerNorm, eps=1e-06), attn_drop_rate: float = 0, d_state: int = 16, **kwargs)
¶
Bases: Module
Visual State Space Block.
Source code in src/models/encoders/vssm_encoder.py
VSSLayer(dim: int, depth: int, attn_drop: float = 0.0, drop_path: float | list[float] = 0.0, norm_layer: type[nn.Module] = nn.LayerNorm, downsample: type[nn.Module] | None = None, use_checkpoint: bool = False, d_state: int = 16, **kwargs)
¶
Bases: Module
A layer containing multiple VSSBlocks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Number of input channels. |
required |
depth
|
int
|
Number of blocks. |
required |
attn_drop
|
float
|
Attention dropout rate. Default: 0.0 |
0.0
|
drop_path
|
float | list[float]
|
Stochastic depth rate. Default: 0.0 |
0.0
|
norm_layer
|
type[Module]
|
Normalization layer. Default: nn.LayerNorm |
LayerNorm
|
downsample
|
type[Module] | None
|
Downsample layer at the end. Default: None |
None
|
use_checkpoint
|
bool
|
Whether to use checkpointing. Default: False |
False
|
d_state
|
int
|
State dimension for Mamba. Default: 16 |
16
|
Source code in src/models/encoders/vssm_encoder.py
VSSMEncoder(patch_size: int = 4, in_chans: int = 3, depths: list[int] | None = None, dims: list[int] | None = None, d_state: int = 16, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.2, norm_layer: type[nn.Module] = nn.LayerNorm, patch_norm: bool = True, use_checkpoint: bool = False, **kwargs)
¶
Bases: Module
Visual State Space Model Encoder.
Hierarchical encoder based on VMamba architecture.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
patch_size
|
int
|
Patch embedding size. Default: 4 |
4
|
in_chans
|
int
|
Number of input channels. Default: 3 |
3
|
depths
|
list[int] | None
|
Depth of each stage. Default: [2, 2, 9, 2] |
None
|
dims
|
list[int] | None
|
Dimensions at each stage. Default: [96, 192, 384, 768] |
None
|
d_state
|
int
|
State dimension for Mamba. Default: 16 |
16
|
drop_rate
|
float
|
Dropout rate. Default: 0.0 |
0.0
|
attn_drop_rate
|
float
|
Attention dropout rate. Default: 0.0 |
0.0
|
drop_path_rate
|
float
|
Stochastic depth rate. Default: 0.2 |
0.2
|
norm_layer
|
type[Module]
|
Normalization layer. Default: nn.LayerNorm |
LayerNorm
|
patch_norm
|
bool
|
Whether to apply norm after patch embedding. Default: True |
True
|
use_checkpoint
|
bool
|
Whether to use checkpointing. Default: False |
False
|
Source code in src/models/encoders/vssm_encoder.py
load_vssm_pretrained_ckpt(model: nn.Module, ckpt_path: str = './pretrain/vmamba_tiny_e292.pth') -> nn.Module
¶
Load pretrained VMamba weights into VSSMEncoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
Model containing vssm_encoder attribute |
required |
ckpt_path
|
str
|
Path to pretrained weights |
'./pretrain/vmamba_tiny_e292.pth'
|
Returns:
| Type | Description |
|---|---|
Module
|
Model with loaded weights |