Manoj Sharma, Omid Moridnejad, Raquel Colares, Thuvaarakkesh Ramanathan
Physiotherapists have the task not only to analyze and execute procedures to the patients, but also register detailed everything during the patient’s evaluation. Documenting all, not only takes time during the consultation, causing them work overload, but also they end up with less time to focus only on the patient.
Based on this healthcare need, the goal of this project is to create an interface where they can attach an audio that will provide the transcription for them, avoiding them to write or type during the patient’s appointment.
This project encompasses a speech-to-text transcription task, with deep unsupervised learning models.
The documentation workload for physiotherapists takes away valuable time from patient care. Our goal is to develop a speech-to-text transcription system using deep unsupervised learning models to automate this process, improving efficiency and reducing administrative burden.
The development of this project relied on doctor-patient conversation audios from the study "PriMock57: A Dataset Of Primary Care Mock Consultations". This dataset served as the primary resource for training the speech-to-text model within a healthcare context. While it focuses on general medical dialogues rather than physiotherapy-specific interactions, it still provided a valuable foundation for building and testing the transcription pipeline. At the time of development, no publicly available datasets containing specifically physiotherapy consultation audio were identified.
However, for testing purposes, we recorded our own simulated physiotherapy consultations in five languages (English, French, Portuguese, Persian, and Hindi) to evaluate both the performance of the models and the functionality of the Streamlit interface.
- Whisper model
For this project, we explored different transcription approaches, and OpenAI’s Whisper model stood out as the most effective. Widely adopted in healthcare applications, Whisper has proven capable of handling real-world, noisy audio while maintaining high transcription accuracy. It performed well on medical dialogues and provided robust multilingual support across 99 languages, making it interesting for our purpose and evaluation. Its strong performance, language coverage, and relevance to clinical contexts made Whisper the ideal choice for establishing a reliable transcription for our project.
- Autoencoder
To perform extractive summarization on the transcriptions generated by Whisper, we implemented a custom autoencoder using PyTorch. The autoencoder is a type of neural network trained to reconstruct its input through a bottleneck (latent space), learning compressed semantic representations of sentences.
There are two fully connected layers in this architecture:
Encoder:
the encoder part compresses the 384-dimensional sentence embeddings (generated using a transformer model) into a 256-dimensional hidden representation.
Decoder:
the decoder reconstructs the original embedding from the hidden representation, trained using a combination of Mean Squared Error (MSE) and Cosine Similarity loss to ensure both geometric and angular closeness.
- BERT
To complement our summarization pipeline, we implemented an extractive summarization method using a pre-trained BERT model (bert-base-uncased). BERT, known for its contextual language understanding, provides sentence-level embeddings that enable us to assess the semantic relevance of each sentence within a transcription.
- Project visualization: https://physio-interface.streamlit.app/
The streamlit can be seen on the link above and also accessing by the following command line on the Anaconda prompt:
streamlit run physio-app.py
Physio-Interface-demo-compress.mp4
Korfiatis A.P, Sarac R., Moramarco F., Savkov A. (2022). PriMock57: A Dataset Of Primary Care Mock Consultations. Available at: https://arxiv.org/abs/2204.00333 (Accessed: February 2025)
Sequence Modeling With CTC Available at: https://distill.pub/2017/ctc/ (Accessed: February 2025)
Platen, P.V. Fine-Tune Wav2Vec2 for English ASR with Transformers Available at: https://huggingface.co/blog/fine-tune-wav2vec2-english (Accessed: February 2025)
OpenAI Platform. Speech to text. Available at: https://platform.openai.com/docs/guides/speech-to-text (Accessed: February 2025)
PyTorch. Cosine Similarity. Available at: https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html (Accessed: April 2025)