Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement endpoint: stream_synthesis #1542

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

Yosshi999
Copy link
Contributor

@Yosshi999 Yosshi999 commented Mar 1, 2025

内容

固定時間(1秒)単位で音声合成を行いストリーミングでレスポンスを返すエンドポイントを実装。

既存のエンドポイント(/synthesis)は音声がすべて完成してからレスポンスを返す(下図)のに対して、
image

この実装は初めの1秒が完成次第レスポンスを返す(下図)のでfirst byteが早い
image

関連 Issue

スクリーンショット・動画など

その他

@Yosshi999 Yosshi999 requested a review from a team as a code owner March 1, 2025 16:55
@Yosshi999 Yosshi999 requested review from Hiroshiba and removed request for a team March 1, 2025 16:55
Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

実装ありがとうございます!!!

どうすれば実行できるかも書いていただけると試しやすいのかなと!!
(エディタでの挙動を確かめたりできるので)

@@ -380,6 +380,51 @@ def multi_synthesis(
background=BackgroundTask(try_delete_file, f.name),
)

@router.post(
"/stream_synthesis",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

あ、どういうシグネチャが良いかいろいろ考えてみたのがこちらにあるので参考にしていただければ!!
#1492 (comment)
(例えばpcmではなくwavで返すのが良さそう、などが実装と違いそう)

ただ仕様がこれでよいのかかなり自信がないので、突っ込みあればウェルカムです!!

@Yosshi999
Copy link
Contributor Author

wavの分割に変えてみます
環境は0.16.0-preview.0です。

エディタから叩くのは大分工事が必要なので、javascriptでそれっぽいサンプルコードを書いてみます

@Hiroshiba
Copy link
Member

おーぜひぜひ!!

@Yosshi999
Copy link
Contributor Author

計測結果:

[normal] query: 16 ms, synthesis: 1132 ms
[stream] query: 13 ms, synthesis (first byte): 283 ms
<html>
    <head>
        <title>音声合成のテスト</title>
        <meta charset="utf-8" />
    </head>
    <body>
        <input
            type="text"
            id="input"
            value="これは音声合成のテストです。ストリーム版はレスポンスを受信しながら同時に再生します。"
            size="50"
        />
        <br />
        <button onclick="speak_normal()">normal</button>
        <button onclick="speak_stream()">stream</button>
        <br />
        <audio id="audio" controls></audio>
        <script>
            async function audio_query() {
                const input = document.getElementById("input").value;
                const url = `http://localhost:50021/audio_query?speaker=1&text=${input}`;
                const response = await fetch(url, {
                    method: "POST",
                    headers: {
                        "Content-Type": "application/json",
                    }
                });
                return await response.text();
            }
            async function speak_normal() {
                const start = performance.now();
                const audio = document.getElementById("audio");
                const query = await audio_query();
                const end1 = performance.now();

                const url = `http://localhost:50021/synthesis?speaker=1`;
                const response = await fetch(url, {
                    method: "POST",
                    headers: {
                        "Content-Type": "application/json",
                    },
                    body: query,
                });
                const blob = await response.blob();
                const end2 = performance.now();
                console.log(`[normal] query: ${Math.round(end1 - start)} ms, synthesis: ${Math.round(end2 - end1)} ms`);

                audio.src = URL.createObjectURL(blob);
                audio.play();
            }

            async function speak_stream() {
                const start = performance.now();
                const audio = document.getElementById("audio");
                const audioContext = new AudioContext();
                const query = await audio_query();
                const end1 = performance.now();

                const url = `http://localhost:50021/stream_synthesis?speaker=1`;
                const response = await fetch(url, {
                    method: "POST",
                    headers: {
                        "Content-Type": "application/json",
                    },
                    body: query,
                });
                const reader = response.body.getReader();
                let recvLength = 0;
                let recvBuffer = new Uint8Array();
                let blobPtr = 44;
                let lastBufferedTime = 0;
                while (true) {
                    const { done, value } = await reader.read();
                    if (done) {
                        break;
                    }
                    recvLength += value.length;
                    const newBuffer = new Uint8Array(recvLength);
                    newBuffer.set(recvBuffer);
                    newBuffer.set(value, recvBuffer.length);
                    recvBuffer = newBuffer;

                    if (recvLength >= blobPtr + 2) {
                        const recvView = new DataView(recvBuffer.buffer);
                        const nextPtr = recvLength - recvLength % 2;
                        const numFrames = (nextPtr - blobPtr) / 2;
                        const audioArrayBuffer = audioContext.createBuffer(1, numFrames, 24000);
                        const audioData = audioArrayBuffer.getChannelData(0);
                        for (let i = 0; i < numFrames; i++) {
                            const sample = recvView.getInt16(blobPtr + i * 2, true);
                            audioData[i] = sample / 32768;
                        }
                        const source = audioContext.createBufferSource();
                        source.buffer = audioArrayBuffer;
                        source.connect(audioContext.destination);
                        source.start(lastBufferedTime);
                        lastBufferedTime = Math.max(lastBufferedTime, audioContext.currentTime) + numFrames / 24000;

                        if (blobPtr === 44) {
                            // this is the first chunk
                            const end2 = performance.now();
                            console.log(`[stream] query: ${Math.round(end1 - start)} ms, synthesis (first byte): ${Math.round(end2 - end1)} ms`);
                        }
                        blobPtr = nextPtr;
                    }
                }
                
            }
        </script>
    </body>
</html>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants