Releases: innat/VideoSwin
v2.0
Summary
Keras 3 implementation of Video Swin Transformer. The official PyTorch weight has been converted to Keras 3 compatible. This implementaiton supports to run the model on multiple backend, i.e. TensorFlow, PyTorch, and Jax.
Full Changelog: v1.1...v2.0
v1.1
v1.0
Checkpoints of VideoSwin in Keras
Checkpoints of VideoSwin: Video Swin Transformer model in keras
. The pretrained weights are ported from official pytorch model. Following are the list of all available model in .h5
format.
Checkpoint Naming Style
For the variation and brevity, the general format is:
dataset = 'K400' # K400, SSV2
pretrained_dataset = 'IN1K' # 'IN1K', 'IN22K`
size = 'B' # 'B', 'L'
patch_size = (2,4,4)
window_size=(8,7,7) # (8,7,7), (16,7,7)
num_frames = 32
input_size = 224
>> checkpoint_name = (
f'TFVideoSwin{size}'
f'{dataset}_'
f'{dataset_ext + "_"'
f'P{patch_size}_'
f'W{window_size}_'
f'{num_frames}x{input_size}.h5'
)
>> checkpoint_name
TFUniFormerV2_K400_K710_L14_32x224.h5
Here, size
represents tiny
, small
, and base
. The pretrained_dataset
refers the initialized pretrained weights while training the video swin model. For example, IN22K
or ImageNet 22K pretrained 2D swin image models are used to initialize in 3D video swin model. The dataset
refers the benchmark dataset, i.e., Kinetics, Something-Something-V2. The patch_size
and window_size
refer the internal parameter of model architecture. The input_frame
and input_size
for video-swin is 32
and 224
respectively. In keras
implementation, the checkpoints are also available in SavedModel and h5 format. Check release page of v.1.1 for the SavedModel checkpoints.
Model Name |
---|
TFVideoSwinT_K400_IN1K_P244_W877_32x224.h5 |
TFVideoSwinS_K400_IN1K_P244_W877_32x224.h5 |
TFVideoSwinB_SSV2_K400_P244_W1677_32x224.h5 |
TFVideoSwinB_K600_IN22K_P244_W877_32x224.h5 |
TFVideoSwinB_K400_IN22K_P244_W877_32x224.h5 |
TFVideoSwinB_K400_IN1K_P244_W877_32x224.h5 |
Here, IN1K
and IN22K
refer to ImageNet 1K and ImageNet 22K. The P244
refers to patch_size
of [2,4,4]
and W877
refers to window_size
of [8,7,7]
. All these models give logit as output that makes it easy to add custom head on top of it for downstream task further. Check the notebook.