Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help]: Reconstruct Emilia dataset using Raw data #390

Open
3manifold opened this issue Jan 29, 2025 · 0 comments
Open

[Help]: Reconstruct Emilia dataset using Raw data #390

3manifold opened this issue Jan 29, 2025 · 0 comments

Comments

@3manifold
Copy link

Problem Overview

I plan to improve several Emilia pipeline components and reconstruct Emilia dataset using the Emilia raw audio files. Documentation suggests to "download the raw audio files from the provided URL list":

(...)To reconstruct the Emilia dataset, you can download the raw audio files from the provided URL list and use our open-source Emilia-Pipe preprocessing pipeline to process the raw data and rebuild the dataset. Additionally, users can employ Emilia-Pipe to preprocess their own raw speech data to meet specific needs. (...)

Nevertheless, downloading/scraping 101,000 hours of audio from sources similar to YouTube is a daunting challenge. Could you please clarify the procedure for obtaining direct access to the raw audio data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant