Skip to content

Commit

Permalink
Add longform narration
Browse files Browse the repository at this point in the history
  • Loading branch information
fakerybakery committed Jan 10, 2025
1 parent 80a0e3b commit 2481417
Show file tree
Hide file tree
Showing 6 changed files with 85 additions and 2 deletions.
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ A lightweight Python library for running TTS models with a unified API.
- 🎯 Focus on ease of use - a single API for all models
- 📦 Minimal dependencies - one package for all models
- 🔌 Extensible architecture - easily add new models
- 💎 Feature-rich - includes longform narration, voice cloning support, and more

## Models

Expand All @@ -33,11 +34,13 @@ pip install simpletts
## Quick Start

```python
from simpletts.models.xtts import XTTS
import soundfile as sf
from simpletts.models.xtts import XTTS

tts = XTTS(device="auto")
# Note: XTTS is licensed under the CPML license which restricts commercial use.
# Easily swap out for F5-TTS:
# from simpletts.models.f5 import F5TTS
# tts = F5TTS(device="auto")

array, sr = tts.synthesize("Hello, world!", ref="sample.wav")

Expand Down
24 changes: 24 additions & 0 deletions docs/longform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Longform Narration

SimpleTTS supports longform narration as an **experimental feature**. Longform narration is powered by [`txtsplit`](https://github.com/fakerybakery/txtsplit). This means it may not always work as expected. Quality may vary.

## Example

Here is an example of how to use the `longform` method to with Kokoro:

```python
from simpletts.models.kokoro import Kokoro
import soundfile as sf

# Initialize Kokoro model
tts = Kokoro(device="auto")

# Synthesize speech
text = """
Enter your longform text here...
"""
audio, sr = tts.longform(text, ref="af")

# Save output audio
sf.write("output.wav", audio, sr)
```
Binary file modified output.wav
Binary file not shown.
24 changes: 24 additions & 0 deletions sample/longform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from simpletts.models.kokoro import Kokoro
import soundfile as sf

# Initialize Kokoro model
tts = Kokoro(device="auto")

# Synthesize speech
text = """
Text-to-speech technology has come a long way in recent years, with many powerful models now available to developers. However, the fragmented ecosystem of TTS libraries poses significant challenges. Each model typically comes with its own unique API, dependencies, and setup requirements, making it difficult for developers to experiment with different models or switch between them as needed.
This is where a unified TTS library becomes invaluable. By providing a consistent interface across multiple models, it dramatically simplifies the development process. Developers can focus on their applications rather than wrestling with different APIs and dependencies for each model they want to try.
A unified library also promotes better code maintainability and portability. When your application's TTS functionality is abstracted behind a common interface, switching models becomes as simple as changing a single line of code. This flexibility is especially important as the field of TTS continues to evolve rapidly, with new and improved models being released regularly.
Additionally, a unified library can handle common tasks like text preprocessing, audio post-processing, and long-form text synthesis consistently across all models. This reduces duplication of effort and helps ensure consistent behavior regardless of the underlying model being used.
For organizations, having a unified TTS library means reduced training time for developers, simplified maintenance, and the ability to easily benchmark different models against each other. It also makes it easier to swap out models based on specific needs - whether that's quality, speed, licensing requirements, or language support.
In conclusion, as TTS technology becomes increasingly important in modern applications, having a unified library isn't just convenient - it's becoming essential for efficient development and maintenance of TTS-enabled applications.
"""
audio, sr = tts.longform(text, ref="af")

# Save output audio
sf.write("output.wav", audio, sr)
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
"tqdm",
"openphonemizer",
"click",
"txtsplit",
],
extras_require={
"xtts": [
Expand Down
31 changes: 31 additions & 0 deletions simpletts/models/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from abc import ABC, abstractmethod
import numpy as np
from typing import Tuple, Optional, Union
from txtsplit import txtsplit
from tqdm import tqdm


class TTSModel(ABC):
Expand All @@ -22,6 +24,35 @@ def synthesize(self, text: str, **kwargs) -> Tuple[np.ndarray, int]:
"""
pass

def longform(self, text: str, **kwargs) -> Tuple[np.ndarray, int]:
"""
Synthesize long text by splitting into chunks and concatenating results.
Args:
text: The text to synthesize
**kwargs: Additional arguments passed to synthesize()
Returns:
Tuple containing:
- Concatenated audio array as numpy array
- Sample rate as integer
"""
chunks = txtsplit(text)

# Synthesize each chunk
audio_chunks = []
sr = None
for chunk in tqdm(chunks):
audio, chunk_sr = self.synthesize(chunk, **kwargs)
if sr is None:
sr = chunk_sr
elif sr != chunk_sr:
raise ValueError("Inconsistent sample rates between chunks")
audio_chunks.append(audio)

# Concatenate audio chunks
return np.concatenate(audio_chunks), sr

@abstractmethod
def __init__(self, device: Optional[str] = "auto", **kwargs):
"""
Expand Down

0 comments on commit 2481417

Please sign in to comment.