cartesia-ai · chongzluong · Oct 4, 2024 · Oct 3, 2024 · Oct 3, 2024 · Oct 3, 2024
diff --git a/fern/api-reference/stream-speech-websocket.mdx b/fern/api-reference/stream-speech-websocket.mdx
@@ -132,3 +132,25 @@ If you don't know the last transcript in advance, you can send an input with an
 You will only receive `done: true` after outputs for the entire context have been returned.
 
 Outputs for a given context will always be in order of the inputs you streamed in. (That is, if you send input A and then input B on a context, you will first receive the chunks corresponding to input A, and then the chunks corresponding to input B.)
+
+## Cancelling Requests
+You may also cancel outgoing requests through the websocket.
+
+To cancel a request, send a JSON message with the following structure:
+
+```json WebSocket Request
+{
+  "context_id": "happy-monkeys-fly",
+  "cancel": true
+}
+```
+
+When you send a cancel request:
+
+1. It will only halt requests that have not begun generating a response yet.
+2. Any currently generating request will continue sending responses until completion.
+
+<Note>
+The `context_id` in the cancel request should match the `context_id` of the request you want to cancel.
+</Note>
+
diff --git a/fern/assets/images/concepts_continuations_multiple_inputs.png b/fern/assets/images/concepts_continuations_multiple_inputs.png
diff --git a/fern/assets/images/concepts_continuations_with_continuations.png b/fern/assets/images/concepts_continuations_with_continuations.png
diff --git a/fern/assets/images/concepts_continuations_without_continuations.png b/fern/assets/images/concepts_continuations_without_continuations.png
diff --git a/fern/images/sonic_ipa_guide.png → fern/assets/images/sonic_ipa_guide.png b/fern/images/sonic_ipa_guide.png → fern/assets/images/sonic_ipa_guide.png
diff --git a/fern/concepts/continuations.mdx b/fern/concepts/continuations.mdx
@@ -0,0 +1,146 @@
+---
+title: Continuations (Conditioning on Past Generations)
+subtitle: Learn how to get the best quality out of multiple inputs
+slug: concepts/continuations
+---
+
+## What are Continuations?
+
+Continuations refers to the ability to extend audio generation segments across multiple sequential inputs. This is also known as **conditioning on past generations**.
+
+This document will cover concepts, for more on implementation see [here](/reference/web-socket/stream-speech/working-with-web-sockets#input-streaming-with-contexts). Note that **continuations are only available through the [websocket endpoint](/reference/web-socket/stream-speech/stream-speech)**.
+
+## Why should I use Continuations?
+
+Given a large transcript or multiple transcripts, technically you could generate all sections independently and stitch them together to get the resulting audio. However, this is both inefficient and also results in unstable audio quality. Let's consider the simple example of `Hello, my name is Sonic. It's very nice to meet you.`
+
+### Prosody
+
+<figure>
+    <img src="/assets/images/concepts_continuations_without_continuations.png" alt="no_continuations" />
+    <figcaption>Figure 1: Generate transcripts independently & stitch them together.</figcaption>
+</figure>
+
+Let's split the example transcript into 3 parts. Transcript 1 will be `Hello, my name is Sonic.`, Transcript 2 will be `It's very nice`, and Transcript 3 will be `to meet you.`. We can generate each independently and combine them to get our final result:
+
+
+<audio controls src="https://cartesia-docs-public.s3.us-east-2.amazonaws.com/concepts/continuations/without_continuations.wav">
+  Your browser does not support the audio element.
+</audio>
+
+Technically we've achieved TTS for the example transcript, but I think we can all agree this sounds incredibly weird. There are two main problems with this:
+- The [prosody](https://en.wikipedia.org/wiki/Prosody_(linguistics)) is off between the clips, so it doesn't sound like a natural sentence. This is because `It's very nice` doesn't know that there's more to the sentence, and `to meet you.` doesn't know that something came before it.
+- The breaks between audio 1, audio 2, and audio 3 are too short relative to a normal speaking cadence.
+
+<figure>
+    <img src="/assets/images/concepts_continuations_with_continuations.png" alt="continuations" />
+    <figcaption>Figure 2: Generate transcripts using continuations.</figcaption>
+</figure>
+
+Let's try the same transcripts, but using continuations:
+
+<AccordionGroup>
+  <Accordion
+    title="Example Python Code"
+  >
+```python
+import asyncio
+from cartesia import AsyncCartesia
+from datetime import datetime
+import os
+import sys
+import wave
+
+async def generate_audio_continuous(client, model_id, output_format, voice_id, transcripts):
+    wave_file = wave.open(f'test_gen_continuations_{datetime.now().strftime("%Y%m%d_%H%M%S")}.wav', 'wb')
+    wave_file.setnchannels(1)
+    wave_file.setsampwidth(2)
+    wave_file.setframerate(output_format['sample_rate'])
+
+    # Connect a websocket
+    ws = await client.tts.websocket()
+    ctx = ws.context()
+
+    for transcript in transcripts:
+        await ctx.send(
+            model_id=model_id,
+            transcript=transcript,
+            voice_id=voice_id,
+            output_format=output_format,
+            continue_=True,
+        )
+
+    # Indicate that no more inputs will be sent. Otherwise, the context will close after 5 seconds of inactivity.
+    await ctx.no_more_inputs()
+
+    async for output in ctx.receive():
+        if "audio" in output:
+            buffer = output["audio"]
+            wave_file.writeframes(buffer)
+
+    wave_file.close()
+    await ws.close()
+
+async def main():
+    transcripts = ['Hello, my name is Sonic.', "It's very nice ", "to meet you."]
+    voice_id = 'bd9120b6-7761-47a6-a446-77ca49132781' # Tutorial Man
+
+    model_id = "sonic-english"
+
+    output_format = {
+        "container": "raw",
+        "encoding": "pcm_s16le",
+        "sample_rate": 44100,
+    }
+
+    # Make sure your API key is set at CARTESIA_API_KEY
+    async with AsyncCartesia(api_key=os.environ.get('CARTESIA_API_KEY'), timeout=200) as client:
+        # Run the generation
+        await generate_audio_continuous(client, model_id, output_format, voice_id, transcripts)
+
+if __name__ == "__main__":
+    loop = asyncio.get_event_loop()
+    loop.run_until_complete(main())
+    loop.close()
+    sys.exit(0)
+```
+  </Accordion>
+</AccordionGroup>
+
+<audio controls src="https://cartesia-docs-public.s3.us-east-2.amazonaws.com/concepts/continuations/with_continuations.wav">
+  Your browser does not support the audio element.
+</audio>
+
+We can hear that the transcript flows much more naturally. 
+
+### Multiple inputs
+
+You'll notice in Figure 2 that there's an `Audio N...` section in the output audio as well. This is because continuations allows you to chain any N inputs together for a coherent audio output.
+
+### What if I want to stream in word by word?
+
+![](/assets/images/concepts_continuations_multiple_inputs.png)
+
+Let's try the following transcripts with continuations: `['Hello, my name is Sonic.', "It's ", 'very ', 'nice ', 'to ', 'meet ', 'you.']`
+
+<Note>
+Note that the one word transcripts each have a space after them if they don't have a punctuation.
+</Note>
+
+<audio controls src="https://cartesia-docs-public.s3.us-east-2.amazonaws.com/concepts/continuations/with_continuations_many_inputs.wav">
+  Your browser does not support the audio element.
+</audio>
+
+One of the more common use-cases we've seen is using an LLM of your choice to stream text into our Text-To-Speech API. Our current API will optimally buffer on our server side, but since some of our users want to generate audio from short snippets, we only begin buffering after first submission. What this means is **you need to buffer the first chunk on your end**.
+
+<Note>
+We recommend having the first chunk be a sentence, and then you can stream in token by token.
+</Note>
+
+We'll be adding a flag shortly in the future to begin the buffering from the first chunk.
+
+## Cancellations
+
+Many of us know the feeling of kicking off a massive job and then realizing that we've made a grave mistake. Luckily we support cancellations. If you haven't signaled the transcript submissions (an empty transcript with `continue=False`), you can submit a cancellation request to prevent queued generations from occurring. This can be helpful for managing concurrency and character usage.
+
+For more on implementation see [here](/reference/web-socket/stream-speech/working-with-web-sockets#cancelling-requests).
diff --git a/fern/docs.yml b/fern/docs.yml
@@ -57,6 +57,8 @@ navigation:
             path: getting-started/dev-quickstart.mdx
       - section: Concepts
         contents:
+          - page: Continuations (Conditioning on Past Generations)
+            path: concepts/continuations.mdx
           - page: Embeddings and Voice Mixing
             path: concepts/embeddings-and-voice-mixing.mdx
       - section: User Guides

diff --git a/fern/fern.config.json b/fern/fern.config.json
@@ -1,4 +1,4 @@
 {
   "organization": "cartesia",
   "version": "0.40.4"
-}
+}
diff --git a/fern/user-guides/custom-pronunciation-guide.mdx b/fern/user-guides/custom-pronunciation-guide.mdx
@@ -19,7 +19,7 @@ Our model follows the [English phonology article on Wikipedia](https://en.wikipe
 
 You can copy/paste some of these uncommon symbols from the original [charts here](https://docs.google.com/spreadsheets/d/1OJbiKtxLyodpNPqVfOu43X2HloLsAixTtFppEuQ\_4pI/edit?usp=sharing).
 
-![](/images/sonic_ipa_guide.png)
+![](/assets/images/sonic_ipa_guide.png)
 
 ## Stresses and vowel length markers