Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add leading silence detection and removal logic #17648

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

gexgd0419
Copy link
Contributor

@gexgd0419 gexgd0419 commented Jan 24, 2025

Link to issue number:

Closes #17614

Summary of the issue:

Some voices output a leading silence part before the actual speech voice. By removing the silence part, the delay between keypress and user hearing the audio will be shorter, therefore make the voices more responsive.

Description of user facing changes

Users may find the voices more responsive. All voices using NVDA's WavePlayer will be affected, including eSpeak-NG, OneCore, SAPI5, and some third-party voice add-ons.

This should only affect the leading silence parts. Silence between sentences or at punctuation marks are not changed, but this may depend on how the voice uses WavePlayer.

Description of development approach

I wrote a header-only library silenceDetect.h in nvdaHelper/local. It supports most wave formats (8/16/24/32-bit integer and 32/64-bit float wave), and uses a simple algorithm: check each sample to see if it's outside threshold range (currently hard-coded to +/- 1/2^10 or 0.0009765625). It uses template-related code and requires C++ 20 standard.

The WasapiPlayer in wasapi.cpp is updated to handle silence. A new member function, startTrimmingLeadingSilence, and the exported version wasPlay_startTrimmingLeadingSilence, is added, to set or clear the isTrimmingLeadingSilence flag. If isTrimmingLeadingSilence is true, the next chunk fed in will have its leading silence removed. When non-silence is detected, isTrimmingLeadingSilence will be reset to false. So every time a new utterance is about to be spoken, startTrimmingLeadingSilence should be called.

In nvwave.py, wasPlay_startTrimmingLeadingSilence(self._player, True) will be called when:

  • the player is initialized;
  • the player is stopped;
  • idle is called;
  • _idleCheck determines that the player is idle.

Usually voices will call idle when an utterance is completed, so that audio ducking can work correctly, so here idle is used to mark the starting point of the next utterance. If a voice doesn't use idle this way, then this logic might be messed up.

As long as the synthesizer uses idle as intended, the synthesizer's code doesn't need to be modified to benefit from this feature.

Other possible ways/things that may worth considering (but hasn't been implemented):

  • Put silence detection/removal logic in a separate module instead of in WavePlayer. The drawback is that every voice synthesizer module needs to be modified to utilize a separate module.
  • Use another audio library, such as PyDub, to detect/remove silence.
  • Add a setting item to turn this feature on or off.
  • Add a public API function to allow a synthesizer to opt out of this feature.

Testing strategy:

These are the delay results I got using 1X speed. "Improved delay" means the delay after applying this PR.

Voice Original Delay Improved Delay
eSpeak NG 89ms 69ms
Huihui OneCore 192ms 74ms
Huihui SAPI5 166ms 57ms

If the speech rate is higher than 1X, the original delay may be shorter, because the leading silence is also shortened during the speed-up process. When there's no leading silence, changing the rate does not affect the delay much.

Considering the margin of error, we can say that the delay of eSpeak NG, OneCore and SAPI5 voices are now at the same level.

Known issues with pull request:

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

@coderabbitai summary

@zstanecic
Copy link
Contributor

@gexgd0419
I am slightly confused... is your change requiring modifications to synth drivers?

@gexgd0419
Copy link
Contributor Author

is your change requiring modifications to synth drivers?

In the current implementation, no. As long as the synth driver calls idle after each utterance, it will work, because the logic is integrated into WavePlayer.

But if separating the logic out of WavePlayer is preferred, then the synth driver will have to be aware of the separate module.

@cary-rowen
Copy link
Contributor

I like this very much, I just ran a simple test without any metrics though.
To put it simply, using the Natural voice adapter and sapi5, in the past, NVDA would not have any voice feedback when I pressed keys quickly. Now this situation has been significantly improved.

@zstanecic
Copy link
Contributor

@gexgd0419
@cary-rowen
I am not so patient... Yes, i wanted to test it from the source, and i did, so there are my findings:
Compared to the current alpha and stable version of NVDA, i see a drastic improvement when it comes to using various speech synths.
So... to conclude things, it is a very nice work!
It works with any tts, even with RHVoice in the NVDA addon which i am using now.
There was a time, when typing echo slowed me down a bot. Now, i can have turned on typing echo and listening to speech while typing.

@gexgd0419 gexgd0419 marked this pull request as ready for review January 24, 2025 15:15
@gexgd0419 gexgd0419 requested a review from a team as a code owner January 24, 2025 15:15
@rmcpantoja
Copy link
Contributor

Hi, I would like to see this from a critical point, having previous experience in TTS stuff:
Some minor delay in these voices don't come in the voices itself; it does after preprocessing steps. For example, text pre-processing, or a text to phoneme backend, language analyzers or other tools, that are applied before synthesize speech with a provided text. Fortunately, the performance of these resources is very few milliseconds, so that wouldn't be a problem.
And, most of the voices are built by already pre-processed audio, if we are talking of unit selection or Hidden Markov Model voices, in the free or comercial field. That means, technically, silences are trimmed before building these voices from their databases. However, a particular case where certain voices are with small silences are the traditional Microsoft Hidden Markov Model voices, which they are no longer supporting it. In this case, your changes are useful for these certain voices that can improve performance during navigation.
But, anyways, let me try it in these days, and retract my comment.

Cheers.

@zstanecic
Copy link
Contributor

Hi @rmcpantoja
This statement in your comment is not so true. RHVoice is also a hidden markov model based.
We don't do anything with silences when building the voices. We need these to detect pauses in the utterances in the training process.
Also, we use the Google's vad library to detect silence sometimes.
Ivona tts is a concatenative tts, and Microsoft natural tts is a concatenative one.
When using such voices, i hear noticeable differences in terms of speed and responsiveness.
To conclude things, all voices get the benefits from this feature, and we don't have impact on resulting speech, that means tiere are no drawbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the responsiveness of voices by trimming the leading silence
4 participants