Dataset : MIRACL-VC1 https://sites.google.com/site/achrafbenhamadou/-datasets/miracl-vc1?pli=1
""Fifteen speakers (five men and ten women) positioned in the frustum of a MS Kinect sensor and utter ten times a set of ten words and ten phrases (see the table below). Each instance of the dataset consists of a synchronized sequence of color and depth images (both of 640x480 pixels). The MIRACL-VC1 dataset contains a total number of 3000 instances.""
We have limited the scope of the project to only predicting the words.
The main code cells are in the files ./data_genertor.ipynb and ./architectures/3d_cnn.ipynb
data_generator.ipynb : Crops lips from face images and store them in the same folder structure as the original.
training_model.ipynb : Model:
Results:
At Epoch 45, the last epoch, the Validation accuracy of the model was 0.5850 which is expected as this was a simple 3DCNN with no memory retention like RNNs.