Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The best audio format supported by FastRTC? #138

Open
limcheekin opened this issue Mar 7, 2025 · 4 comments
Open

The best audio format supported by FastRTC? #138

limcheekin opened this issue Mar 7, 2025 · 4 comments

Comments

@limcheekin
Copy link

From my understanding, Opus is the best audio format supported by WebRTC, may I know if FastRTC supports it? If not, which audio format works best with FastRTC?

I'm using https://github.com/remsky/Kokoro-FastAPI?tab=readme-ov-file#features, it support the following audio formats:

  • mp3
  • wav
  • opus
  • flac
  • m4a
  • pcm
@freddyaboulton
Copy link
Owner

Do you get an error with opus @limcheekin ?

@limcheekin
Copy link
Author

Thanks for quick response, @freddyaboulton.

The following code block works when the value of the settings.TTS_AUDIO_FORMAT is pcm:

        with tts_client.audio.speech.with_streaming_response.create(
            model=settings.TTS_MODEL,
            voice=settings.TTS_VOICE,
            input=text,
            response_format=settings.TTS_AUDIO_FORMAT,
            extra_body={"backend": settings.TTS_BACKEND, "language": settings.LANGUAGE},
        ) as stream_audio:
            # Iterate through all audio chunks in the stream
            for i, audio_chunk in enumerate(stream_audio.iter_bytes(chunk_size=1024)):
                print(f"Processing audio chunk {i}")
                audio_array = np.frombuffer(audio_chunk, dtype=np.int16).reshape(1, -1)
                yield (24000, audio_array)

But when the value of the settings.TTS_AUDIO_FORMAT is opus, I change the last 2 lines of code above to the following:

        audio_array = np.frombuffer(audio_chunk, dtype=np.uint8).reshape(1, -1)
        yield (24000, audio_array)

(By the way, I don't know much about audio processing, the code above is suggested by ChatGPT.)

Then, the app raise the following error:

traceback %s Traceback (most recent call last):
  File "/media/limcheekin/My Passport/ws/py/talk-to-localai/.venv/lib/python3.12/site-packages/fastrtc/utils.py", line 178, in player_worker_decode
    frame = av.AudioFrame.from_ndarray(  # type: ignore
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "av/audio/frame.pyx", line 111, in av.audio.frame.AudioFrame.from_ndarray
  File "av/utils.pyx", line 64, in av.utils.check_ndarray
ValueError: Expected numpy array with dtype `float32` but got `uint8`

Error processing frame: %s Expected numpy array with dtype `float32` but got `uint8`

Please advise. Thank you.

@freddyaboulton
Copy link
Owner

Please use int16 or float32 not int8! Let me know if that works

@limcheekin
Copy link
Author

@freddyaboulton Unfortunately, neither option is working. I can only hear a faint, incomprehensible sound like "seh... seh...".

Using int16 doesn't produce an error, but float32 raises: buffer size must be a multiple of element size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants