Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AWS plugin for TTS, STT and LLM #1302

Open
wants to merge 51 commits into
base: main
Choose a base branch
from
Open

Support AWS plugin for TTS, STT and LLM #1302

wants to merge 51 commits into from

Conversation

jayeshp19
Copy link
Collaborator

@jayeshp19 jayeshp19 commented Dec 26, 2024

This PR implements AWS plugin for TTS and STT

Copy link

changeset-bot bot commented Dec 26, 2024

🦋 Changeset detected

Latest commit: c2e1384

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
livekit-plugins-aws Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

return credentials.access_key, credentials.secret_key


TTS_SPEECH_ENGINE = Literal["standard", "neural", "long-form", "generative"]
Copy link
Member

@theomonnom theomonnom Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this to another file. Check how we do it for other TTS/STT


response = await client.synthesize_speech(**_strip_nones(params))

if "AudioStream" in response:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit avoid the extra indent here

except Exception as e:
logger.exception(f"an error occurred while streaming inputs: {e}")

handler = TranscriptEventHandler(stream.output_stream, self._event_ch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we create a separate class?

self,
*,
voice: str | None = "Ruth",
language: TTS_LANGUAGE | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
language: TTS_LANGUAGE | None = None,
language: TTS_LANGUAGE | None = None,

We should always allow a str too here, we can't guarantee we will update the languages quickly

*,
voice: str | None = "Ruth",
language: TTS_LANGUAGE | None = None,
output_format: TTS_OUTPUT_FORMAT = "pcm",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to expose the output format. we only support pcm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do support mp3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true!

@jayeshp19 jayeshp19 changed the title [draft] Support AWS plugin for TTS and STT Support AWS plugin for TTS and STT Jan 20, 2025
@jayeshp19 jayeshp19 marked this pull request as ready for review January 20, 2025 09:51

# If API key and secret are provided, create a session with them
if api_key and api_secret:
session = boto3.Session(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this making network calls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boto3.Session() doesn’t make network calls, but session.get_credentials() does if API keys and secrets aren’t cached., but we’re calling it during initialization, it’s a one-time operation.

Comment on lines 154 to 173
try:
async for frame in self._input_ch:
if isinstance(frame, rtc.AudioFrame):
await stream.input_stream.send_audio_event(
audio_chunk=frame.data.tobytes()
)
await stream.input_stream.end_stream()

except Exception as e:
logger.exception(f"an error occurred while streaming inputs: {e}")

async def handle_transcript_events():
try:
async for event in stream.output_stream:
if isinstance(event, TranscriptEvent):
self._process_transcript_event(event)
except Exception as e:
logger.exception(
f"An error occurred while handling transcript events: {e}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using try — finally. We have an utility for it here

finally:
await utils.aio.gracefully_cancel(*tasks)
except Exception as e:
logger.exception(f"An error occurred while streaming inputs: {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is swallowing exceptions? In this case the baseclass will not try to reconnect on failure

Comment on lines 108 to 115
def get_client(self):
"""Returns a client creator context."""
return self._session.create_client(
"polly",
region_name=self._opts.speech_region,
aws_access_key_id=self._api_key,
aws_secret_access_key=self._api_secret,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we hide this?

from typing import Any, Callable

import aiohttp
from aiobotocore.session import AioSession, get_session # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't support types?

Copy link
Collaborator Author

@jayeshp19 jayeshp19 Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't have official support for types. I found this thread boto/boto3#2213 and this library https://github.com/youtype should we include it?

@jayeshp19 jayeshp19 changed the title Support AWS plugin for TTS and STT Support AWS plugin for TTS, STT and LLM Jan 27, 2025
@jenfic
Copy link

jenfic commented Feb 5, 2025

Getting the following error when calling a function:
ValidationException:http://internal.amazon.com/coral/com.amazon.bedrock/
"message":"The toolConfig field must be defined when using toolUse and toolResult content blocks."

@Karan-Rajesh-Nair
Copy link

Karan-Rajesh-Nair commented Feb 6, 2025

Hey @jayeshp19 i was going thorugh the error and wanted to see if these solutions could help

  1. Added imports for proper type support
    from mypy_boto3_bedrock_runtime import BedrockRuntimeClient
    from boto3.session import Session

  2. Fixed the client initialization to use proper typing:
    self._session: Session = Session(
    aws_access_key_id=self._api_key,
    aws_secret_access_key=self._api_secret,
    region_name=region
    )
    self._client: BedrockRuntimeClient = self._session.client("bedrock-runtime")

  3. Updated the type hint in LLMStream:
    client: BedrockRuntimeClient,

and the last one for converse_stream would we be able to use invoke_model_with_response_stream in _run method.

@vanics
Copy link
Contributor

vanics commented Feb 6, 2025

Good work so far. I am looking forward to this one.

@sunilvb
Copy link

sunilvb commented Feb 6, 2025

Just tested this PR. Overall, amazing speeds when using aws.LLM(), aws.TTS(), and aws.STT() !!

Here are a couple of errors from amazon_transcribe:

2025-02-06 17:05:51,236 - ERROR livekit.plugins.aws - Error in handle_transcript_events
Traceback (most recent call last):
File "..//livekit/agents/utils/log.py", line 16, in async_fn_logs
return await fn(*args, **kwargs)
File "..//livekit/plugins/aws/stt.py", line 161, in handle_transcript_events
async for event in stream.output_stream:
File "..//amazon_transcribe/eventstream.py", line 666, in aiter
parsed_event = self._parser.parse(event)
File "..//amazon_transcribe/deserialize.py", line 161, in parse
raise self._parse_event_exception(raw_event)
amazon_transcribe.exceptions.BadRequestException: Your request timed out because no new audio was received for 15 seconds. {"pid": 71376, "job_id": "AJ_ctSPy6HPRp57"}
2025-02-06 17:05:51,239 - ERROR livekit.agents.pipeline - Error in _recognize_task
Traceback (most recent call last):
File "..//livekit/agents/utils/log.py", line 16, in async_fn_logs
return await fn(*args, **kwargs)
File "..//livekit/agents/pipeline/human_input.py", line 150, in _recognize_task
await asyncio.gather(*tasks)
File "..//livekit/agents/pipeline/human_input.py", line 136, in _stt_stream_co
async for ev in stt_stream:
File "..//livekit/agents/stt/stt.py", line 321, in anext
raise exc from None
File "..//livekit/agents/stt/stt.py", line 219, in _main_task
return await self._run()
File "..//livekit/plugins/aws/stt.py", line 170, in _run
await asyncio.gather(*tasks)
File "..//livekit/agents/utils/log.py", line 16, in async_fn_logs
return await fn(*args, **kwargs)
File "..//livekit/plugins/aws/stt.py", line 161, in handle_transcript_events
async for event in stream.output_stream:
File "..//amazon_transcribe/eventstream.py", line 666, in aiter
parsed_event = self._parser.parse(event)
File "..//amazon_transcribe/deserialize.py", line 161, in parse
raise self._parse_event_exception(raw_event)
amazon_transcribe.exceptions.BadRequestException: Your request timed out because no new audio was received for 15 seconds. {"pid": 71376, "job_id": "AJ_ctSPy6HPRp57"}

@sunilvb
Copy link

sunilvb commented Feb 6, 2025

Also, seeing the same error as @jenfic when using the function calls with aws.LLM():

ValidationException:http://internal.amazon.com/coral/com.amazon.bedrock/
"message":"The toolConfig field must be defined when using toolUse and toolResult content blocks."

Not sure if pytest could/should catch these errors.

@sunilvb
Copy link

sunilvb commented Feb 6, 2025

@jayeshp19 let me know if you need further help testing. I can help spin-up all the services needed in AWS... (wink, wink).

@meetakshay99
Copy link

meetakshay99 commented Feb 7, 2025

I tried using this AWS plugin for TTS but could not hear any audio.
I am using it with PipelineAgent and could see this printed to logs

2025-02-07 13:43:07,366 - DEBUG livekit.agents.pipeline - speech playout started {"speech_id": "112916673385", "pid": 358, "job_id": "AJ_xSYnj6Kpfpbk"}
2025-02-07 13:43:07,929 - DEBUG livekit.agents.pipeline - speech playout finished {"speech_id": "112916673385", "interrupted": false, "pid": 358, "job_id": "AJ_xSYnj6Kpfpbk"}
2025-02-07 13:43:07,930 - DEBUG livekit.agents.pipeline - committed agent speech {"agent_transcript": "today", "interrupted": false, "speech_id": "112916673385", "pid": 358, "job_id": "AJ_xSYnj6Kpfpbk"}

but did not hear any audio.

Below are the details sent to aws while creating the TTS object:
{'api_key': 'my_api_key', 'api_secret': 'my_secret_key', 'speech_region': 'eu-west-1', 'voice': 'Matthew', 'speech_engine': 'neural'}
What may I be doing wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants