A graph convolutional network for skeleton based action recognition. This paper proposed a novel model of dynamic skeletons called Spatial Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. The proposed approach also has greater generalization capability than existing approaches.
Kindly refer to the below link for the code repository of the paper, which holds the codebase, dataset and models: Github
While maintaining the same performance and flexibility in both training and inference, in this repository, we change the original model implementation from 5D
input to 4D
input, and replace the Einsum
operator with the equivalent Matmul
operation, enabling the model for more general usage on non-GPU/CPU devices. For example, for Huawei's Ascend AI processor, by the date of this repository, the CANN software stack does not support 5D
or Einsum
yet. This could apply to other ASIC/FPGA AI processors. Details are provided in the below sections.
Pretrained model on Kinetics dataset: drive
Pretrained model on NTU RGB+D dataset (crossview): drive
Transfer the pytorch model to ONNX with script pytorch2onnx.py
ONNX Model for Kinetics dataset (batch=1): drive
-
Python3
-
pip install onnx
for Pytorch to onnx conversion -
Create conda environment to install dependencies
-
All other dependencies use
pip install -r requirements.txt
-
After cloning repo, run the following:
cd torchlight
;python setup.py install
;cd ..
The original paper is tested on two skeleton dataset generated from the Kinetics dataset and NTU RGB+D. The scripts to generate skeletons are provided in the paper's repo and forked here in tools/
. Alternative, the original paper also provides processed skeleton data that can be download direclty. Details as below.
Kinetics dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s. Skeletons from each frame are extracted using Openpose and the resultant dataset is called kinetics-skeleton.
Dataset can be downloaded (7.5GB) from the this Google Drive as provided in the original git repository.
After uncompressing, the database can be built by the scirpt provided by the original repo:
python tools/kinetics_gendata.py --data_path <path to kinetics-skeleton>
NTU RGB+D contains 60 action classes and 56,880 video samples. The dataset contains RGB videos, depth map sequences, 3D skeletal data, and infrared (IR) videos for each sample.
As stated in the original paper, only the 3D skeletal data was used for the experiments. Dataset can be downloaded from their website
The following script is provided from the orignal repo to build the database for training or evaluation:
python tools/ntu_gendata.py --data_path <path to nturgbd+d_skeletons>
As mentioned above we used both datasets for model training, evaluation and inference. Both datasets have train and validation sets.
-
Data from the data loader is loaded as 5-dimensional (5D) tensors as N, C, T, V, M corresponding to batch size, channels, number of sequences, number of joints/ graph nodes, number of persons.
-
We merged dimensions to create 4D tensors before passing it to the model for prediction. We used the approach of merging V and M (number of graph nodes and number of persons) dimensions. V = 18 and V = 25 for kinetics-skeleton and NTU RGB+D datasets respectively. For both datasets, M = 2.. Resultant tesnor was of shape 'N, MV, C, T'
-
Code changes also supports dynamic N, C, T, V, M values. For instance, a dataset can contain only a single or multiple persons in an input frame, number of vertices/graph nodes in spatial domain can be varied depending on the type pose estimate algorithm used. We have handled these scenarios and no further code changes are required for training and validation.
Let’s first understand its working below:
-
The working of einusm is similar to that of Matrix multiplication under the hood. One such advantage of einsum is that it handles multiplying matrices with different dimensions without explicitly stating transpose and permute operations.
-
In the above example, we can see the input matrices of shape (2,3 - > i,k) and (5,3 -> j,k). Einsum operator multiples across the ‘k’ dimension to create resultant matrix of (2,5 -> i, j) dimension.
-
We used matrix multiplication as an alternative to Einsum for its operation on the AI processor.
- As seen above, we need to explicitly match the dimensions (transpose) before the matmul operation (i,k , k,j -> i, j) to create resultant matrix.
As stated above, data is loaded as a 4D tensor during training and validation to be compatible with Ascend AI processor.
-
Changes were made in the train and test functions. Data from the loader were used to view and merge V and M dimensions before passing it to the model.
-
Resultant tensor shape was 'N, MV, C, T'
Einsum replacement with matmul was implemented in the tgcn.py:
Original st-gcn:
-
Inputs to the einsum are basically two matrices ‘x’ and ‘A’. x is formulated as a 5D tensor with NM, K, C, T, V. Here the batch size ‘NM’ is merged with number of persons ‘M’ as N*M.
-
‘A’ is an adjacency matrix which of dimension K, V, V. Here K is the spatial kernel size and V is number of vertices depending on which dataset is used. For better naming convention we rename K,V,V to K,V,W.
-
Input dimensions to einsum are (‘NM,K,C,T,V’) and (‘K,V,W’). It performs multiplication across the matched axis of ‘K’ and ‘V’. Resultant matrix dimension is ‘NM,C,T,W’
Modified with matmul :
-
Our modification involves using x as a 4D matrix of dimension NM, C, T, K*V.
-
Adjacency matrix ‘A’ is used as 2D tensor before matmul operation. The input dimensions are K*V, W.
-
Input dimensions to matmul operation are (‘NM, C, T, KV’) and (‘KV, W’). For matmul operation, the last dimension of x must match the first dimension of A i.e. K*V.
-
Resultant of the matmul is a 4D tensor of dimension ‘NM,C,T,W’. We verified that the output dimensions and matrix values match the einusm operation.
We trained 1. the original st-gcn model
and 2. model with code changes as stated above from scratch
, to show that the modified model can achieve the same performance.
Following table shows the top1 accuracy acheived on both datasets:
Dataset / Model | Original st-gcn | Modified st-gcn |
---|---|---|
Kinetics-skeleton (%) | 31.05 | 31.05 |
NTU RGB+D (Cross View) (%) | 79.57 | 79.39 |
During training and validation, we logged train loss, accuracy metrics using tensorboard.
Following are graph comparisons between original st-gcn and modified st-gcn:
Blue curve indicates original st-gcn ; Pink curve indicates st-gcn with code changes Orange curve indicates original st-gcn ; Red curve indicates st-gcn with code changesFor demo on test videos, we configured openpose and executed the following:
python main.py demo_offline [--video ${PATH_TO_VIDEO}] [--openpose ${PATH_TO_OPENPOSE}]
Arguments:
--PATH_TO_VIDEO:
path to test video
--PATH_TO_OPENPOSE:
path to openpose build folder
Note: We did not perform the demo experiments for realtime pose estimate and prediction.