Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long-running tasks - solution overview #129

Merged
merged 1 commit into from
Jan 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added guides/long-running-tasks/images/lrt-so1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added guides/long-running-tasks/images/lrt-so2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added guides/long-running-tasks/images/lrt-so3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
167 changes: 167 additions & 0 deletions guides/long-running-tasks/solution-overview.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
title: 'Solution Overview'
description: 'Performing Long-Running Tasks on SaladCloud'
---

There are [two primary methods](/products/sce/getting-started/architectural-overview#provide-input-to-applications) to
provide inputs and retrieve results from applications running on SaladCloud:

- **SaladCloud’s Container Gateway:** The client application sends requests to the gateway, which forwards them to the
container instances, waits for their responses, and then relays the responses back to the client.
- **Job Queue System: Salad Kelpie, Salad Job Queue, AWS SQS, or other queue solutions, such as Redis, RabbitMQ, or
Kafka** can be used to distribute jobs to instances integrated with a job queue client. The results are then delivered
via webhooks or retrieved through job queries.

The input and output data of a job can be directly embedded in the request/response and job/result. For larger datasets,
it is recommended to exchange data between the instances and cloud storage, while the gateway or job queue only carries
references of the data.

You can also create a **single-replica container group** for a job and use environment variables to specify the cloud
storage locations for the job's input and output. In this setup, the gateway or job queue is not required.

**Performing long-running tasks on SaladCloud involves the following key considerations:**

### Container Gateway or Job Queue

SaladCloud’s Container Gateway has a maximum
[server response timeout](/products/sce/gateway/load-balancer-options#server-response-timeout) of **100 seconds**. This
timeout specifies the maximum duration the gateway will wait for a response from an instance after forwarding a request.
If no response is received within this timeframe, the gateway returns a timeout error to the client. For jobs that may
run more than 100 seconds, such as video generation and long audio transcription, you will need to either use webhooks
to receive the results asynchronously while still using the gateway, or integrate with a job queue system for better
handling of long-running tasks.

### Concurrency and Throughput

The simplest approach while using the gateway or a job queue is to process a job at a time, ensuring that each job is
fully completed before the next one begins.

<img src="/guides/long-running-tasks/images/lrt-so1.png" />

The lifecycle of job execution may involve multiple stages, including downloading the input, pre-processing (e.g.,
format conversion, re-samping and chunking), executing GPU inference, post-processing (e.g., merging and summarizing),
uploading the output. Each stage has different resource demands—some are I/O-bound, others are CPU-intensive, and some
are GPU-intensive.

If the GPU inference constitutes only a small portion of the process, concurrent processing can significantly enhance
throughput and optimize resource utilization for these types of jobs.

While leveraging CPU and I/O capacity effectively, multiprocessing or multithreading-based concurrent inference over a
single GPU might limit optimal GPU cache utilization and degrade performance, due to each inference running in isolation
at its own layer or stage. Additionally, the multiprocessing approach can increase VRAM consumption, as each process
requires a separate CUDA context and loads its own model into the GPU VRAM.

The recommendation is to use a multi-pipeline, staged architecture, where GPU inference remains sequential or batched by
a separated process or thread, while other tasks run in parallel or asynchronously across one or more processes or
threads to maximize efficiency. Typically, a local queue is employed to connect these two parts.

<img src="/guides/long-running-tasks/images/lrt-so2.png" />

Compared to single inference, batched inference may achieve higher throughput, but usually requires more VRAM. The
decision should be based on your specific use case and resource selection.

<img src="/guides/long-running-tasks/images/lrt-so3.png" />

[This repository](https://github.com/SaladTechnologies/mds/tree/main/inference-server) provides the reference
implementation of inference servers that support concurrent processing while using the gateway and the job queue - AWS
SQS.

To handle multiple jobs concurrently while using a job queue, AWS SQS is preferred, as it allows your application to
handle each job independently and flexibly, including pulling the job, extending its leasing, and deleting it once
completed.
[In our YouTube video transcription benchmark test with AWS SQS](https://blog.salad.com/ai-batch-transcription-benchmark/),
which transcribed 1 million hours of video using 100 mid-end GPUs within a week, utilizing three or more threads for
downloading and pre-processing enabled us to achieve nearly 100% GPU utilization.

### Save Running State

SaladCloud operates on a foundation of distributed and interruptible nodes, meaning that a node hosting your instance
may go down unexpectedly at any time. However, the majority of nodes are capable of running continuously for over 10
hours. Please refer to [the benchmark test results](https://github.com/SaladTechnologies/mds/tree/main/demo-app3)
detailing uptime and the number of interruptions for jobs running over a 48-hour period on SaladCloud.

If an instance fails to complete (or delete) a job due to the node failure, the job will typically become available for
other instances via the job queue system. To avoid restarting an unfinished job from scratch after an
interruption—particularly for jobs expected to run longer than 30 minutes (depending on the use case), such as molecular
dynamics simulations—**consider implementing the following features in your applications:**

- Start fresh while pulling a new job.
- Regularly save and upload the running state to cloud storage during job execution.
- Download and resume from the previous running state if retrieving an unfinished task.

Key application settings, such as saving intervals, upload data size, and required upload speed, should be carefully
tested and aligned:

- Longer saving intervals reduce the required upload speed but may result in greater GPU processing losses if nodes go
down.
- Shorter saving intervals require higher upload speeds, which may lead to upload backlogs.

The asymmetric bandwidth of SaladCloud is notable, as many nodes are located in residential networks. Your applications
may perform
[initial checks to filter nodes](https://docs.salad.com/tutorials/high-performance-apps#perform-initial-checks-to-filter-nodes)
with specific download and upload speeds.

Consider using [open-source tools like Rclone](/tutorials/high-performance-storage-solutions) to optimize performance
and throughput. These tools support chunked, parallel uploads and downloads, enabling more efficient utilization of
bandwidth by establishing multiple simultaneous connections.

Additionally, progressive or incremental uploads can reduce the overall required upload speed with shorter saving
intervals, while also allowing you to tap into a larger pool of available nodes.

Your applications can implement the above mechanism alongside any job queue system, such as Salad Kelpie, Salad Job
Queue and AWS SQS. Alternatively, you can create a single-replica container group for each job and integrate these
features, eliminating the need for a job queue.

**Since the release of Salad Kelpie 0.5.0, it supports two use cases:**

- Use Kelpie as a job queue system along with its built-in data synchronization for easy implementation.
- Use Kelpie solely as a job queue while implementing custom, flexible data management strategies, including selective
downloads and progressive uploads.

### Super Long Running Jobs

Certain large molecular dynamics simulations, or model fine-tunning and hyperparameter tuning jobs, can run for several
days or even a week, generating tens to hundreds of gigabytes of data (e.g., trajectory files or checkpoints at various
time points) that must be uploaded to cloud storage. This is fully achievable on SaladCloud, as validated by customer
use cases. However, both AWS SQS and Salad Job Queue are not ideal for this case.

Due to the 12-hour maximum visibility timeout imposed by AWS SQS (starting from when the job is first received,
extending the timeout doesn't reset this limit), the long-running solution built on it can handle tasks for up to 12
hours at a time. After this period, the job reappears in the queue and can be picked up by another instance to continue
execution.

Salad Job Queue can tolerate up to 3 node failures for a job. When more nodes fail during job execution, the job will be
reported as failed. For jobs running longer than a day, the likelihood of failure increases significantly.

For super long running jobs, we recommend using Salad Kelpie:

- **Simplified Architecture:** It significantly reduces application complexity by eliminating the need for job and
leasing management.
- **Enhanced Task Duration:** It allows unlimited node reallocations, enabling seamless support for longer running tasks
on SaladCloud.

If you have a small number of very long-running jobs, you can also create a single-replica container group for each job
and use environment variables to specify the locations of the job's input, state, and output. In this scenario, a job
queue system is unnecessary. **While a job remains unprocessed during node reallocation, our tests show that such
interruptions are minimal, occurring less than 4% of the time over a few-day period.**

### Summary

Here is a summary of the recommended solutions for each scenario:

| Job Duration | Use Case | Solution Description |
| :---------------- | :------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Under 100 seconds | Image generation, LLM streaming, real-time transcription | SaladCloud’s Container Gateway, or any job queue system. |
| Under 30 minutes | Video generation, LLM non-streaming, long audio transcription | Any job queue system. To handle multiple jobs concurrently while using a job queue, AWS SQS is preferred. |
| Under 12 hours | Molecular dynamics simulation, model fine-tuning | Any job queue system. Applications need to regularly save running state to cloud storage during job execution, and resume from the previous running state if retrieving an unfinished task. Use the Kelpie built-in data synchronization for easy implementation. |
| Over 12 hours | Large molecular dynamics simulation, model fine-tuning and hyperparameter tuning | Salad Kelpie is preferred for simplified architecture and enhanced task duration. For a small number of very long-running jobs, use the single-replica container group, eliminating the need for a job queue. |

Please refer to the example code and applications for further details:

| No | Description | Repository |
| :-- | :------------------------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------------------------------------------------------------------------ |
| 1 | Reference implementation of inference server using SaladCloud’s Container Gateway, supporting concurrent processing with batched inference. | [link](https://github.com/SaladTechnologies/mds/tree/main/inference-server#use-the-container-gateway-load-balancer) |
| 2 | Reference implementation of inference server integrated with AWS SQS, supporting concurrent processing with single inference. | [link](https://github.com/SaladTechnologies/mds/tree/main/inference-server#use-a-job-queue) |
| 3 | Use Kelpie as the job queue along with its built-in data management. | [link](https://github.com/SaladTechnologies/mds/tree/main/demo-app1) |
| 4 | Use Kelpie solely as the job queue, while implementing flexible data management strategies, including selective downloads and progressive uploads | [link](https://github.com/SaladTechnologies/mds/tree/main/demo-app3) |
| 5 | Use AWS SQS as the job queue, while implementing flexible data management strategies. | [link](https://github.com/SaladTechnologies/mds/tree/main/demo-app2v2) |
4 changes: 4 additions & 0 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -454,6 +454,10 @@
{
"group": "Text To Speech",
"pages": ["guides/text-to-speech/metavoice-tts-api", "guides/text-to-speech/openvoice-api"]
},
{
"group": "Long-Running Tasks",
"pages": ["guides/long-running-tasks/solution-overview"]
}
]
}
Expand Down
Loading