Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BIF-424] [BIF-428] [BIF-429] Adding new continuations concepts page #17

Merged
merged 13 commits into from
Oct 4, 2024
Merged
22 changes: 22 additions & 0 deletions fern/api-reference/stream-speech-websocket.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -132,3 +132,25 @@ If you don't know the last transcript in advance, you can send an input with an
You will only receive `done: true` after outputs for the entire context have been returned.

Outputs for a given context will always be in order of the inputs you streamed in. (That is, if you send input A and then input B on a context, you will first receive the chunks corresponding to input A, and then the chunks corresponding to input B.)

## Cancelling Requests
You may also cancel outgoing requests through the websocket.

To cancel a request, send a JSON message with the following structure:

```json WebSocket Request
{
"context_id": "happy-monkeys-fly",
"cancel": true
}
```

When you send a cancel request:

1. It will only halt requests that have not begun generating a response yet.
2. Any currently generating request will continue sending responses until completion.

<Note>
The `context_id` in the cancel request should match the `context_id` of the request you want to cancel.
</Note>

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
146 changes: 146 additions & 0 deletions fern/concepts/continuations.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
---
title: Continuations (Conditioning on Past Generations)
subtitle: Learn how to get the best quality out of multiple inputs
slug: concepts/continuations
---

## What are Continuations?

Continuations refers to the ability to extend audio generation segments across multiple sequential inputs. This is also known as **conditioning on past generations**.

This document will cover concepts, for more on implementation see [here](/reference/web-socket/stream-speech/working-with-web-sockets#input-streaming-with-contexts). Note that **continuations are only available through the [websocket endpoint](/reference/web-socket/stream-speech/stream-speech)**.

## Why should I use Continuations?

Given a large transcript or multiple transcripts, technically you could generate all sections independently and stitch them together to get the resulting audio. However, this is both inefficient and also results in unstable audio quality. Let's consider the simple example of `Hello, my name is Sonic. It's very nice to meet you.`

### Prosody

<figure>
<img src="/assets/images/concepts_continuations_without_continuations.png" alt="no_continuations" />
<figcaption>Figure 1: Generate transcripts independently & stitch them together.</figcaption>
</figure>

Let's split the example transcript into 3 parts. Transcript 1 will be `Hello, my name is Sonic.`, Transcript 2 will be `It's very nice`, and Transcript 3 will be `to meet you.`. We can generate each independently and combine them to get our final result:


<audio controls src="https://cartesia-docs-public.s3.us-east-2.amazonaws.com/concepts/continuations/without_continuations.wav">
Your browser does not support the audio element.
</audio>

Technically we've achieved TTS for the example transcript, but I think we can all agree this sounds incredibly weird. There are two main problems with this:
- The [prosody](https://en.wikipedia.org/wiki/Prosody_(linguistics)) is off between the clips, so it doesn't sound like a natural sentence. This is because `It's very nice` doesn't know that there's more to the sentence, and `to meet you.` doesn't know that something came before it.
- The breaks between audio 1, audio 2, and audio 3 are too short relative to a normal speaking cadence.

<figure>
<img src="/assets/images/concepts_continuations_with_continuations.png" alt="continuations" />
<figcaption>Figure 2: Generate transcripts using continuations.</figcaption>
</figure>

Let's try the same transcripts, but using continuations:

<AccordionGroup>
<Accordion
title="Example Python Code"
>
```python
import asyncio
from cartesia import AsyncCartesia
from datetime import datetime
import os
import sys
import wave

async def generate_audio_continuous(client, model_id, output_format, voice_id, transcripts):
wave_file = wave.open(f'test_gen_continuations_{datetime.now().strftime("%Y%m%d_%H%M%S")}.wav', 'wb')
wave_file.setnchannels(1)
wave_file.setsampwidth(2)
wave_file.setframerate(output_format['sample_rate'])

# Connect a websocket
ws = await client.tts.websocket()
ctx = ws.context()

for transcript in transcripts:
await ctx.send(
model_id=model_id,
transcript=transcript,
voice_id=voice_id,
output_format=output_format,
continue_=True,
)

# Indicate that no more inputs will be sent. Otherwise, the context will close after 5 seconds of inactivity.
await ctx.no_more_inputs()

async for output in ctx.receive():
if "audio" in output:
buffer = output["audio"]
wave_file.writeframes(buffer)

wave_file.close()
await ws.close()

async def main():
transcripts = ['Hello, my name is Sonic.', "It's very nice ", "to meet you."]
voice_id = 'bd9120b6-7761-47a6-a446-77ca49132781' # Tutorial Man

model_id = "sonic-english"

output_format = {
"container": "raw",
"encoding": "pcm_s16le",
"sample_rate": 44100,
}

# Make sure your API key is set at CARTESIA_API_KEY
async with AsyncCartesia(api_key=os.environ.get('CARTESIA_API_KEY'), timeout=200) as client:
# Run the generation
await generate_audio_continuous(client, model_id, output_format, voice_id, transcripts)

if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
sys.exit(0)
```
</Accordion>
</AccordionGroup>

<audio controls src="https://cartesia-docs-public.s3.us-east-2.amazonaws.com/concepts/continuations/with_continuations.wav">
Your browser does not support the audio element.
</audio>

We can hear that the transcript flows much more naturally.

### Multiple inputs

You'll notice in Figure 2 that there's an `Audio N...` section in the output audio as well. This is because continuations allows you to chain any N inputs together for a coherent audio output.

### What if I want to stream in word by word?

![](/assets/images/concepts_continuations_multiple_inputs.png)

Let's try the following transcripts with continuations: `['Hello, my name is Sonic.', "It's ", 'very ', 'nice ', 'to ', 'meet ', 'you.']`

<Note>
Note that the one word transcripts each have a space after them if they don't have a punctuation.
</Note>

<audio controls src="https://cartesia-docs-public.s3.us-east-2.amazonaws.com/concepts/continuations/with_continuations_many_inputs.wav">
Your browser does not support the audio element.
</audio>

One of the more common use-cases we've seen is using an LLM of your choice to stream text into our Text-To-Speech API. Our current API will optimally buffer on our server side, but since some of our users want to generate audio from short snippets, we only begin buffering after first submission. What this means is **you need to buffer the first chunk on your end**.

<Note>
We recommend having the first chunk be a sentence, and then you can stream in token by token.
</Note>

We'll be adding a flag shortly in the future to begin the buffering from the first chunk.

## Cancellations

Many of us know the feeling of kicking off a massive job and then realizing that we've made a grave mistake. Luckily we support cancellations. If you haven't signaled the transcript submissions (an empty transcript with `continue=False`), you can submit a cancellation request to prevent queued generations from occurring. This can be helpful for managing concurrency and character usage.

For more on implementation see [here](/reference/web-socket/stream-speech/working-with-web-sockets#cancelling-requests).
2 changes: 2 additions & 0 deletions fern/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ navigation:
path: getting-started/dev-quickstart.mdx
- section: Concepts
contents:
- page: Continuations (Conditioning on Past Generations)
path: concepts/continuations.mdx
- page: Embeddings and Voice Mixing
path: concepts/embeddings-and-voice-mixing.mdx
- section: User Guides
Expand Down
2 changes: 1 addition & 1 deletion fern/fern.config.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"organization": "cartesia",
"version": "0.40.4"
}
}
2 changes: 1 addition & 1 deletion fern/user-guides/custom-pronunciation-guide.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Our model follows the [English phonology article on Wikipedia](https://en.wikipe

You can copy/paste some of these uncommon symbols from the original [charts here](https://docs.google.com/spreadsheets/d/1OJbiKtxLyodpNPqVfOu43X2HloLsAixTtFppEuQ\_4pI/edit?usp=sharing).

![](/images/sonic_ipa_guide.png)
![](/assets/images/sonic_ipa_guide.png)

## Stresses and vowel length markers

Expand Down
Loading